1.1 What Breaks Real AI Systems

1.1 What Breaks Real AI Systems

Most AI failures do not come from “AI problems.” They come from numerical problems.

When we look at cutting-edge models—LLMs, diffusion models, optimization pipelines—it’s easy to imagine that failures happen because the algorithms are too complex, or the architecture is wrong, or the data is insufficient.

In reality, many failures start much deeper. They begin with the smallest units of computation: numbers, matrices, and the operations we perform on them.

The Illusion of Perfect Math

On paper, every equation behaves beautifully. Systems of linear equations always have a clean solution. Matrix factorizations always work. Gradient descent always moves you closer to the minimum.

Inside a machine, none of that is guaranteed.

Computers do not use real numbers. They use floating-point numbers, an approximation of real values stored in binary form. This means:

  • Some numbers cannot be represented exactly.
  • Rounding happens constantly.
  • Small errors compound over time.

A model that looks perfect on paper can collapse in practice because the numbers inside it slowly drift off the path.

The Hidden Fragility of AI Pipelines

To understand what breaks real AI systems, it helps to look at the patterns that appear again and again across organizations, languages, and problem domains.

1. Ill-conditioned problems

A problem is ill-conditioned when a tiny change in input causes a huge change in output. In floating-point arithmetic, tiny changes happen constantly—so ill-conditioning turns microscopic noise into catastrophic error.

Common triggers include:

  • Nearly dependent features
  • Correlated embeddings
  • Extremely small or large values mixed together

An ill-conditioned matrix doesn’t “almost” work. It actively destroys stability.

2. Naturally unstable algorithms

Some algorithms amplify numerical noise by design. A few famous examples:

  • Naive Gaussian elimination (without pivoting)
  • Gram–Schmidt orthogonalization (classic version)
  • Normal equations for least squares

Engineers often implement these because they look simple in a textbook—only to discover that the implementation behaves nothing like the theory.

3. Loss of significance

This happens when subtracting two nearly identical numbers causes the meaningful digits to cancel out, leaving only numerical noise. It’s subtle and almost impossible to detect without understanding the underlying arithmetic.

Loss of significance is the silent killer of simulations, ML training loops, and financial models.

4. Overflow and underflow

When values become too large, they overflow to infinity. When they become too small, they underflow to zero.

Softmax instability? Hyperparameter explosions? Exploding/vanishing gradients? All of them trace back to this fundamental limitation.

5. Poorly chosen decompositions

Using LU decomposition on a matrix that wants QR. Using QR on a matrix that wants SVD. Using normal equations when the problem requires a more stable method.

Choosing the wrong solver is like using a flathead screwdriver on a Phillips screw—technically possible, but only if nothing goes wrong.

6. Scaling issues

When values vary across many orders of magnitude, numerical precision collapses. This is why ML pipelines include:

  • normalization
  • standardization
  • whitening
  • log transforms

Scaling is not just a preprocessing trick. It is a numerical survival mechanism.

Why These Problems Matter

What makes numerical issues especially dangerous is that they masquerade as something else. Models appear to fail for mysterious reasons:

  • “Training diverged.”
  • “Loss suddenly spiked.”
  • “The model is unstable.”
  • “Gradients exploded out of nowhere.”
  • “This algorithm works on paper but breaks in production.”

But underneath, the failures often stem from:

  • rounding error
  • cancellation
  • ill-conditioning
  • inappropriate algorithms
  • poor scaling

Modern AI systems are built on top of enormous matrix operations. If those operations become unstable, everything above them—attention layers, embeddings, optimizers, inference pipelines—starts to wobble.

A Simple Rule for Real-World Systems

The more sophisticated the AI system becomes, the more its stability depends on the quality of its numerical foundations.

This leads us to the heart of the issue:

Textbook mathematics assumes perfect numbers. Computers do not have perfect numbers.

Everything you think you “know” about linear algebra changes the moment you step into floating-point arithmetic.

To understand how AI systems truly behave, we must first understand how floating-point numbers really work.

That is the topic of the next section.

In 1.2, we’ll peel back the abstraction and look at the computational model itself—how numbers are stored, how rounding works, and why the smallest implementation detail can completely change an algorithm’s behavior. It is the difference between theory and practice, and it explains why numerical linear algebra is not optional knowledge for modern engineers.

2025-09-03

Shohei Shimoda

I organized and output what I have learned and know here.