1.4 A Brief Tour of Real-World Failures

1.4 A Brief Tour of Real-World Failures

To understand why numerical linear algebra matters, it helps to see what happens when it goes wrong. These failures are not obscure textbook anecdotes; they come from real systems—financial models, control systems, simulations, machine learning pipelines, and even large-scale AI models. In every case, the root cause was the same: a mismatch between mathematical expectation and computational reality.

Failure #1 — The Exploding Model

In machine learning training, it is common to see loss suddenly explode after hours of stable progress. Logs show nothing suspicious. Gradients look fine. Weights appear reasonable—until everything becomes NaN in a single step.

Often the cause is shockingly small: a single subtraction of two nearly equal numbers producing catastrophic cancellation, followed by a division that magnifies the resulting numerical noise until the model destabilizes.

The architecture wasn’t wrong. The math wasn’t wrong. The computational pathway was.

Failure #2 — The “Correct” Algorithm That Never Converged

Many iterative solvers (like Conjugate Gradient or GMRES) work beautifully on paper—but when implemented, they sometimes refuse to converge. Developers try changing hyperparameters, rewriting loops, even switching hardware, only to discover the root cause later: the matrix was poorly conditioned.

The algorithm was correct. The problem itself was fragile.

A condition number of 10⁸ means this: you must lose eight digits of precision just by trying to solve the problem.

No amount of clever coding will save a problem that fundamentally can’t be trusted.

Failure #3 — When Simulations Diverge “Randomly”

Physics simulations, fluid solvers, particle systems, and reinforcement learning environments will sometimes diverge even when the equations are perfectly valid. Two runs with the same inputs produce different outcomes. Tiny perturbations grow with each timestep until the system collapses.

The reason? Each timestep introduces small floating-point errors. A mathematically stable method may still be computationally unstable when accumulated error exceeds the system’s tolerance.

This is why “stable integrators” exist—not for beauty, but for survival.

Failure #4 — PCA and Embedding Drift

Principal Component Analysis (PCA) is mathematically simple—yet in production it sometimes produces inconsistent components across runs. Embeddings drift. Recommendations fluctuate.

This usually happens because the underlying SVD is being performed on large, nearly collinear matrices. A tiny change in floating-point ordering leads to a large rotation in the eigenvectors. Mathematically equivalent outcomes become computationally incompatible.

Failure #5 — The Hidden Enemy in Vector Search

Modern vector search systems depend heavily on numerical linear algebra—distance computations, normalization, orthogonal projections, and SVD-based compression. But when vectors are high-dimensional and nearly collinear, distance metrics become unstable. Two vectors that should be distinct collapse toward each other. Nearest-neighbor decisions flip unexpectedly.

Search quality suffers, and the cause often goes unnoticed because the system “still runs.” But beneath the surface, numerical precision has quietly changed the meaning of the embeddings.

Failure #6 — The AI Model That Only Broke in Production

Some LLM-based applications work flawlessly on a developer’s machine, only to behave unpredictably in production. Investigations reveal that the same matrix operations were being executed on slightly different hardware or using different BLAS/LAPACK implementations with different precision trade-offs.

The model wasn’t wrong. The computational ecosystem changed.

What These Failures Have in Common

Across all of these examples, the pattern is identical:

  • The math looked correct.
  • The code looked correct.
  • The system still failed.

Because between math and code lies a third layer: numerical computation—a world with its own rules.

Why This Matters for the Rest of This Book

This final section of Chapter 1 is meant to serve as a wake-up call. Real systems fail not because engineers don’t know linear algebra, but because they don’t know how it behaves inside a machine.

That is what the rest of this book will cover—how numbers are represented, how operations really execute, how precision is lost, how algorithms actually survive (or fail) inside computing hardware.

To understand these failures at a deeper level, we need to look directly at the environment where all numerical algorithms live: the machine itself.

Chapter 2 — The Computational Model begins with floating-point numbers, rounding behavior, ULPs (Units in the Last Place), and the practical realities of how numbers are stored and manipulated inside modern hardware. Once you see the limits of computation clearly, every algorithm—LU, QR, SVD, iterative methods, even neural network training—will suddenly make much more sense.

2025-09-06

Shohei Shimoda

I organized and output what I have learned and know here.