2.1 Floating-Point Numbers (IEEE 754)

2.1 Floating-Point Numbers (IEEE 754)

If you opened up a computer and looked for a “number,” you wouldn’t find one. Not a single digit. Not a decimal point. Not even a minus sign.

Computers do not store numbers the way humans think of numbers. Instead, they store patterns of bits—tiny sequences of 0s and 1s— and interpret those patterns according to a set of rules called IEEE 754 floating-point arithmetic. This standard quietly governs almost every calculation in machine learning, simulation, graphics, optimization, and scientific computing.

It is one of the most important engineering standards ever created, and yet most developers never study it in detail. They use it every day without knowing they are using it.


The Core Idea: Scientific Notation for Machines

Floating-point numbers are really just a computerized version of scientific notation. Humans write large or small numbers like this:

  1.23 × 104

Computers do the same thing, except:

  • they use base 2 instead of base 10,
  • they store only a finite number of bits,
  • they split the number into three components:
    • sign bit: positive or negative
    • exponent: the power of two
    • mantissa (fraction): the digits of the number

In other words, a floating-point value is essentially:

value = (−1)sign × (1.fraction bits) × 2(exponent − bias)

This simple structure is the foundation for every operation your model performs. Matrix multiplies, convolutions, eigenvalue calculations, optimizers— everything is built on top of these three fields:

  • 1 bit for the sign
  • 8 or 11 bits for the exponent (in float32 or float64)
  • 23 or 52 bits for the fraction

No matter how sophisticated your algorithm is, this is the world it lives in. A world with limits.


An Illustration: The Finite World of Float32

Let’s examine the most common format in machine learning: float32. It uses:

  • 1 bit for the sign
  • 8 bits for the exponent
  • 23 bits for the mantissa

This means it can represent roughly:

  • 4.3 billion distinct patterns
  • ~7 decimal digits of precision
  • numbers up to about 3.4 × 1038
  • numbers as small as 1.4 × 10−45

That may sound enormous, but it isn’t enough for many numerical tasks. Seven digits of precision disappear surprisingly quickly during ML training or matrix manipulations.

When two numbers differ beyond seven digits, the smaller one simply vanishes in addition. When you subtract two nearly identical numbers, most meaningful digits cancel out. When you chain multiplications, errors snowball.

These behaviors are not bugs. They are consequences of float32’s limitations.


A Bit Pattern Becomes a Number

Consider the 32-bit pattern:

0 10000010 01100000000000000000000

Break it down:

  • Sign bit = 0 → positive
  • Exponent = 10000010₂ → 130 decimal
  • Mantissa = 011000...0

IEEE 754 applies a bias of 127 to float32 exponents, so the actual exponent is:

130 − 127 = 3

The mantissa represents 1.375 in decimal. So the value is:

1.375 × 2³ = 11

Every floating-point value you’ve ever used is produced by this kind of decoding. Your neural network weights? Floating point. Your gradients? Floating point. Your loss values? Floating point.

Nothing is exact.

And this is why numerical algorithms must be designed with care: every calculation happens in an environment where precision is scarce.


The Invisible Constraint: Density of Representable Numbers

One of the strangest features of floating-point arithmetic is that numbers are not evenly spaced.

Near zero, representable numbers are extremely dense. Far from zero, they become extremely sparse.

This means:

  • You can represent tiny changes near zero.
  • You cannot represent tiny changes in very large numbers.

A 1-unit increase in a number around 1 billion might not be representable at all. The machine simply jumps over it.

This uneven spacing shapes how errors accumulate and how algorithms behave. It is one reason why “subtract two large, close numbers” can be fatal— there may be only a handful of representable values in that region.


Special Values: Inf, −Inf, NaN

The IEEE 754 standard defines several special values:

  • +∞ — too large to represent
  • −∞ — too negative to represent
  • NaN — not a number (invalid operations)

These special values are not errors; they are signals.

Division by zero? → 0/0 or sqrt(−1)? → NaN Overflow during training? → Invalid gradient? → NaN

ML practitioners often encounter NaNs during training and assume a bug. In reality, NaNs are often floating-point’s way of saying:

“Your numbers escaped the representable universe.”


Denormal (Subnormal) Numbers

At magnitudes close to zero, the IEEE 754 standard provides a “soft landing zone” called denormal numbers. These represent extremely tiny values, bridging the gap between zero and the smallest normal number.

They are represented with a modified format (no implicit 1 before the mantissa) and are less precise and slower to compute.

Many processors handle denormals inefficiently, drastically increasing latency. Some ML frameworks even flush them to zero for performance.

Once again, the machine bends mathematics to fit reality.


Floating-Point Arithmetic: Correct but Not Exact

The IEEE 754 standard guarantees that each operation returns the closest representable result.

This is called correct rounding, and it is a remarkable achievement— a consistent global rule for arithmetic across all hardware.

But “closest representable” does not mean “exact.” Every operation may introduce a tiny rounding error. Chain a billion such operations and “tiny” becomes “significant.”

This is why floating-point arithmetic must be respected. Algorithms that ignore its limits tend to fail.


Why This Matters for Linear Algebra

Every decomposition—LU, QR, SVD—relies on a long sequence of floating-point operations. And each step introduces rounding.

Stable algorithms (like QR via Householder reflections) minimize the propagation of these errors. Unstable algorithms (like naive Gaussian elimination without pivoting) amplify them.

To build reliable numerical systems, you must understand:

  • where numbers lose precision,
  • why underflow and overflow happen,
  • how rounding biases creep in,
  • how many bits of meaningful information remain after an operation.

This is what separates software that “usually works” from systems that always work.


Where We Go From Here

Now that we understand what floating-point numbers are—finite, discrete, unevenly spaced, and full of subtle behavior—we can go one level deeper.

If floating-point formats define what a number is, then machine epsilon, rounding rules, and ULPs define how close two numbers can be and how errors travel through computation.

In the next section, we explore:

  • machine epsilon — the smallest representable gap between numbers,
  • rounding modes — how values are forced into representable form,
  • ULPs — the true “distance” between floating-point values.

This is where floating-point arithmetic reveals both its strengths and its fragility. Let’s step into 2.2 Machine epsilon, rounding, and ULPs.

2025-09-08

Shohei Shimoda

I organized and output what I have learned and know here.