1.4 A Brief Tour of Real-World Failures
1.4 A Brief Tour of Real-World Failures
To understand why numerical linear algebra matters, it helps to see what happens when it goes wrong. These failures are not obscure textbook anecdotes; they come from real systems—financial models, control systems, simulations, machine learning pipelines, and even large-scale AI models. In every case, the root cause was the same: a mismatch between mathematical expectation and computational reality.
Failure #1 — The Exploding Model
In machine learning training, it is common to see loss suddenly explode after hours of stable progress. Logs show nothing suspicious. Gradients look fine. Weights appear reasonable—until everything becomes NaN in a single step.
Often the cause is shockingly small: a single subtraction of two nearly equal numbers producing catastrophic cancellation, followed by a division that magnifies the resulting numerical noise until the model destabilizes.
The architecture wasn’t wrong. The math wasn’t wrong. The computational pathway was.
Failure #2 — The “Correct” Algorithm That Never Converged
Many iterative solvers (like Conjugate Gradient or GMRES) work beautifully on paper—but when implemented, they sometimes refuse to converge. Developers try changing hyperparameters, rewriting loops, even switching hardware, only to discover the root cause later: the matrix was poorly conditioned.
The algorithm was correct. The problem itself was fragile.
A condition number of 10⁸ means this: you must lose eight digits of precision just by trying to solve the problem.
No amount of clever coding will save a problem that fundamentally can’t be trusted.
Failure #3 — When Simulations Diverge “Randomly”
Physics simulations, fluid solvers, particle systems, and reinforcement learning environments will sometimes diverge even when the equations are perfectly valid. Two runs with the same inputs produce different outcomes. Tiny perturbations grow with each timestep until the system collapses.
The reason? Each timestep introduces small floating-point errors. A mathematically stable method may still be computationally unstable when accumulated error exceeds the system’s tolerance.
This is why “stable integrators” exist—not for beauty, but for survival.
Failure #4 — PCA and Embedding Drift
Principal Component Analysis (PCA) is mathematically simple—yet in production it sometimes produces inconsistent components across runs. Embeddings drift. Recommendations fluctuate.
This usually happens because the underlying SVD is being performed on large, nearly collinear matrices. A tiny change in floating-point ordering leads to a large rotation in the eigenvectors. Mathematically equivalent outcomes become computationally incompatible.
Failure #5 — The Hidden Enemy in Vector Search
Modern vector search systems depend heavily on numerical linear algebra—distance computations, normalization, orthogonal projections, and SVD-based compression. But when vectors are high-dimensional and nearly collinear, distance metrics become unstable. Two vectors that should be distinct collapse toward each other. Nearest-neighbor decisions flip unexpectedly.
Search quality suffers, and the cause often goes unnoticed because the system “still runs.” But beneath the surface, numerical precision has quietly changed the meaning of the embeddings.
Failure #6 — The AI Model That Only Broke in Production
Some LLM-based applications work flawlessly on a developer’s machine, only to behave unpredictably in production. Investigations reveal that the same matrix operations were being executed on slightly different hardware or using different BLAS/LAPACK implementations with different precision trade-offs.
The model wasn’t wrong. The computational ecosystem changed.
What These Failures Have in Common
Across all of these examples, the pattern is identical:
- The math looked correct.
- The code looked correct.
- The system still failed.
Because between math and code lies a third layer: numerical computation—a world with its own rules.
Why This Matters for the Rest of This Book
This final section of Chapter 1 is meant to serve as a wake-up call. Real systems fail not because engineers don’t know linear algebra, but because they don’t know how it behaves inside a machine.
That is what the rest of this book will cover—how numbers are represented, how operations really execute, how precision is lost, how algorithms actually survive (or fail) inside computing hardware.
To understand these failures at a deeper level, we need to look directly at the environment where all numerical algorithms live: the machine itself.
Chapter 2 — The Computational Model begins with floating-point numbers, rounding behavior, ULPs (Units in the Last Place), and the practical realities of how numbers are stored and manipulated inside modern hardware. Once you see the limits of computation clearly, every algorithm—LU, QR, SVD, iterative methods, even neural network training—will suddenly make much more sense.
Shohei Shimoda
I organized and output what I have learned and know here.タグ
検索ログ
Development & Technical Consulting
Working on a new product or exploring a technical idea? We help teams with system design, architecture reviews, requirements definition, proof-of-concept development, and full implementation. Whether you need a quick technical assessment or end-to-end support, feel free to reach out.
Contact Us