Why Computers Need the Floating-Point Number Format

Why Computers Need the Floating-Point Number Format

The answer lies in the amount of data that the CPU can process simultaneously, the Word.

The answer lies in the amount of data that the CPU can process simultaneously. CPUs cannot handle all the memory at once; they work in chunks of data. The RAM also stores the data in chunks instead of a stream of 0s and 1s.

This limited chunk of data is the key.

Think of this chunk of data as a small piece of paper. CPUs can only process one paper at a time.

A chunk of data: Word

Current computers have a lot of RAM; more than one gigabyte is the norm. Does that mean that the computer can process all that information at once? No.

The CPU is the part of the computer that processes data. It works in cycles; the CPU can only process a small amount of data in each cycle. This smaller unit is called a word.

RAM works in a similar fashion. It stores words, not just a stream of 0s and 1s. Think of the RAM as a shelf where we have direct access to all the books. Yet, RAM allows us to retrieve only one book at a time, so if we need data stored in two books, we need two trips to the self.

Representation of numbers

How do we store numbers in these binary chunks of data we call words? How do you think that our computers store a number like 1992273.14159?

Let’s start with something simple: storing 5dec, which in binary is 101bin, is easy; we store “101”. Yet, it’s not always so simple.

It sounds obvious when told, but computers really only work with 0s and 1s. But numbers rely on more symbols than 0 and 1, for example, 0.25dec or 0.01bin. 0.01bin is not just 0s and 1s. It has a binary point “.”.

So, how do we represent a binary point with 0s and 1s? We cannot. We need to come up with clever ways to differentiate before and after the binary point.

A simple number format

Let’s try representing 0.25dec (0.01bin) in eight bits.

What do we do with the binary point? For example, we can put the rule that the first four bits are for the integer part of the number and the last four are after the binary point.

In this representation, first, we convert the number to binary and then fit it into our format. Here is a list of some numbers represented following these rules:

Precision

In case you didn’t realize, we had to cut 0.2dec. In binary, the number is 0.00110011…bin with an infinite repetition of “0011”s. Therefore we lost precision by cutting it to only four bits.

Range

Our standard cannot handle numbers greater than fifteen because we need five bits for sixteen and we only have four. Therefore, it has a low range of possible numbers to represent.

Finding the right balance of precision and range is key to a good format.

A word of 64 bits

Luckily, modern computers use a word of 64 bits, which allows for much more range and precision than our previous eight-bit word.

The IEEE is the institution in charge of creating formats that manufacturers follow. For example, the Double-Precision Floating-Point Format, this format uses 64 bits (one word in most modern computers) to store numbers.

It uses the scientific notation and then stores the different parts of it. Without going much into the details, this is how this format stores the numbers:

(-1)sign x 1.f x 2e-1023

See how the idea of using the scientific notation is smart because it allows computers to store larger numbers without extra space.

Just the perfect convention

In the end, the Double-Precision Floating-Point Format is just a convention. Computer manufacturers and software companies agree on using this format, and it has become the norm. But if words (chunks of data used by CPUs and RAM) became suddenly smaller or larger, a new format could become more popular.

The key is in the chunk

The Double-Precision Floating-Point Format is the perfect solution for computers with words of 64 bits. It provides a lot of range and precision while keeping the representation in the size of one word, allowing numbers to be retrieved with one trip to the RAM and processed in one cycle by the CPUs.

The Double-Precision Floating-Point Format fits perfectly in most modern computers’ hardware while giving a lot of precision and a large range of numbers.

If you like this post, consider sharing it with your friends on Twitter or forwarding this email to them 🙈

Don't hesitate to reach out to me if you have any questions or see an error. I highly appreciate it.

Thanks for reading, don't be a stranger 👋

GIMTEC is the newsletter I wish I had earlier in my software engineering career.

Every other Wednesday, I share an article on a topic that you won't learn at work.

Join more than 3,000 subscribers below.

Thanks for subscribing! A confirmation email has been sent.

Check the SPAM folder if you don't receive it shortly.

Sorry, there was an error 🤫.

Try again and contact me at llorenc[at]gimtec.io if it doesn't work. Thanks!