1.4 A Brief Tour of Real-World Failures

To understand why numerical linear algebra matters, it helps to see what happens when it goes wrong. These failures are not obscure textbook anecdotes; they come from real systems—financial models, control systems, simulations, machine learning pipelines, and even large-scale AI models. In every case, the root cause was the same: a mismatch between mathematical expectation and computational reality.

Failure #1 — The Exploding Model

In machine learning training, it is common to see loss suddenly explode after hours of stable progress. Logs show nothing suspicious. Gradients look fine. Weights appear reasonable—until everything becomes NaN in a single step.

Often the cause is shockingly small: a single subtraction of two nearly equal numbers producing catastrophic cancellation, followed by a division that magnifies the resulting numerical noise until the model destabilizes.

The architecture wasn’t wrong. The math wasn’t wrong. The computational pathway was.

Failure #2 — The “Correct” Algorithm That Never Converged

Many iterative solvers (like Conjugate Gradient or GMRES) work beautifully on paper—but when implemented, they sometimes refuse to converge. Developers try changing hyperparameters, rewriting loops, even switching hardware, only to discover the root cause later: the matrix was poorly conditioned.

The algorithm was correct. The problem itself was fragile.

A condition number of 10⁸ means this: you must lose eight digits of precision just by trying to solve the problem.

No amount of clever coding will save a problem that fundamentally can’t be trusted.

Failure #3 — When Simulations Diverge “Randomly”

Physics simulations, fluid solvers, particle systems, and reinforcement learning environments will sometimes diverge even when the equations are perfectly valid. Two runs with the same inputs produce different outcomes. Tiny perturbations grow with each timestep until the system collapses.

The reason? Each timestep introduces small floating-point errors. A mathematically stable method may still be computationally unstable when accumulated error exceeds the system’s tolerance.

This is why “stable integrators” exist—not for beauty, but for survival.

Failure #4 — PCA and Embedding Drift

Principal Component Analysis (PCA) is mathematically simple—yet in production it sometimes produces inconsistent components across runs. Embeddings drift. Recommendations fluctuate.

This usually happens because the underlying SVD is being performed on large, nearly collinear matrices. A tiny change in floating-point ordering leads to a large rotation in the eigenvectors. Mathematically equivalent outcomes become computationally incompatible.

Failure #5 — The Hidden Enemy in Vector Search

Modern vector search systems depend heavily on numerical linear algebra—distance computations, normalization, orthogonal projections, and SVD-based compression. But when vectors are high-dimensional and nearly collinear, distance metrics become unstable. Two vectors that should be distinct collapse toward each other. Nearest-neighbor decisions flip unexpectedly.

Search quality suffers, and the cause often goes unnoticed because the system “still runs.” But beneath the surface, numerical precision has quietly changed the meaning of the embeddings.

Failure #6 — The AI Model That Only Broke in Production

Some LLM-based applications work flawlessly on a developer’s machine, only to behave unpredictably in production. Investigations reveal that the same matrix operations were being executed on slightly different hardware or using different BLAS/LAPACK implementations with different precision trade-offs.

The model wasn’t wrong. The computational ecosystem changed.

What These Failures Have in Common

Across all of these examples, the pattern is identical:

The math looked correct.
The code looked correct.
The system still failed.

Because between math and code lies a third layer: numerical computation—a world with its own rules.

Why This Matters for the Rest of This Book

This final section of Chapter 1 is meant to serve as a wake-up call. Real systems fail not because engineers don’t know linear algebra, but because they don’t know how it behaves inside a machine.

That is what the rest of this book will cover—how numbers are represented, how operations really execute, how precision is lost, how algorithms actually survive (or fail) inside computing hardware.

To understand these failures at a deeper level, we need to look directly at the environment where all numerical algorithms live: the machine itself.

Chapter 2 — The Computational Model begins with floating-point numbers, rounding behavior, ULPs (Units in the Last Place), and the practical realities of how numbers are stored and manipulated inside modern hardware. Once you see the limits of computation clearly, every algorithm—LU, QR, SVD, iterative methods, even neural network training—will suddenly make much more sense.

2025-09-06

numerical failures

floating point errors

ill-conditioned problems

AI system stability

numerical linear algebra issues

Shohei Shimoda

I organized and output what I have learned and know here.

カテゴリー

Computation & Mathematical Systems

Microsoft Teams Bots

検索ログ

Hello World bot 755 IT assistant bot 702 Deploy Teams bot to Azure 700 Microsoft Bot Framework 678 Adaptive Card Action.Submit 646 Microsoft Graph 626 Bot Framework example 613 Graph API token 610 Bot Framework Adaptive Card 600 Zendesk Teams integration 596 Adaptive Cards 594 Microsoft Teams Task Modules 594 ServiceNow bot 592 Teams app zip 591 Teams bot development 589 Bot Framework CLI 588 Teams chatbot 586 Azure App Service bot 585 Azure bot registration 585 Teams production bot 584 Azure Bot Services 581 Azure CLI webapp deploy 580 Task Modules 576 bot for sprint updates 574 Bot Framework proactive messaging 570 Teams bot packaging 566 Teams bot tutorial 563 Microsoft Entra ID 560 Bot Framework prompts 558 C 557

Development & Technical Consulting

Working on a new product or exploring a technical idea? We help teams with system design, architecture reviews, requirements definition, proof-of-concept development, and full implementation. Whether you need a quick technical assessment or end-to-end support, feel free to reach out.

1.4 A Brief Tour of Real-World Failures

1.4 A Brief Tour of Real-World Failures

Failure #1 — The Exploding Model

Failure #2 — The “Correct” Algorithm That Never Converged

Failure #3 — When Simulations Diverge “Randomly”

Failure #4 — PCA and Embedding Drift

Failure #5 — The Hidden Enemy in Vector Search

Failure #6 — The AI Model That Only Broke in Production

What These Failures Have in Common

Why This Matters for the Rest of This Book

Shohei Shimoda

カテゴリー

タグ

検索ログ

Development & Technical Consulting