7.3 Least Squares Problems
7.3 Least Squares Problems
Least squares problems appear everywhere in applied mathematics, data science, and machine learning. Any time we try to fit a model, estimate parameters, denoise signals, or approximate relationships, we are (knowingly or unknowingly) solving a least squares problem.
Formally, given a matrix A and a vector b, we wish to find x that minimizes:
‖Ax − b‖₂
This is the geometric version of the problem. Algebraically, it expands to minimizing the sum of squared residuals. Practically, it means: find the best possible solution when the system has no exact solution.
Least squares is the mathematical backbone of:
- Linear regression
- Signal reconstruction
- Time-series smoothing
- Computer vision calibration
- Machine learning (as an inner subproblem)
- Overdetermined physical measurement systems
Because of its universality, the way we solve least squares matters enormously. A “good-enough” method in theory can become dangerously unstable in floating-point arithmetic.
The geometry of least squares
Least squares has an incredibly clean geometric interpretation:
Solve Ax ≈ b by projecting b onto the column space of A.
Imagine the columns of A as vectors forming a subspace. The vector b lives somewhere in the ambient space. The goal is to find the point in the column space that is closest to b. That point is A x̂, where x̂ is the least squares solution.
This geometric viewpoint immediately reveals something important:
Solving least squares is fundamentally about orthogonality.
If we can build an orthonormal basis for the column space of A, everything becomes easy and numerically stable.
The normal equations — and their pitfalls
The most “textbook” approach to solving least squares is using the normal equations:
Aᵀ A x = Aᵀ b
This looks elegant, but it hides a dangerous problem.
The matrix Aᵀ A squares the condition number of A.
That means:
- Ill-conditioned systems become catastrophically ill-conditioned.
- Small errors in
Aorbbecome amplified. - The system becomes unstable even for moderate datasets.
Moreover, forming Aᵀ A explicitly often introduces unnecessary rounding errors. Floating-point summation is not exact, especially for large-scale data.
So while the normal equations are mathematically valid, in practice they often lead to:
- Poor accuracy
- Unstable estimates
- Incorrect solutions under floating-point arithmetic
This is why numerical linear algebra textbooks, libraries, and ML frameworks strongly recommend against using them except for small, well-behaved systems.
QR decomposition to the rescue
If we decompose A = QR where Q is orthogonal and R is upper triangular, the least squares problem becomes:
‖QRx − b‖₂
Because Q preserves norms, this simplifies neatly to:
‖Rx − Qᵀ b‖₂
Now R is triangular, so solving the least squares system becomes:
R x = Qᵀ b
This is stable, efficient, and avoids forming Aᵀ A. This method is the foundation of robust least-squares solvers everywhere.
Two crucial advantages:
- Orthogonality preserves numerical accuracy
- The triangular system is easy to solve and well-behaved
Because QR handles rank-deficient or nearly dependent columns better than normal equations, it serves as the standard approach in ML, statistics, and scientific computing.
SVD: the ultimate least-squares solver
Although QR is excellent, the most robust solver for least squares is the Singular Value Decomposition (SVD).
SVD directly reveals:
- Rank deficiencies
- Near-linear dependence
- Conditioning and numerical sensitivity
It computes the pseudoinverse in a stable way:
x = A⁺ b
This is the gold standard solver, especially for ill-conditioned systems. The trade-off is cost: SVD is slower than QR.
Thus, numerical libraries follow a rule of thumb:
- QR for most problems
- SVD when accuracy outweighs performance
Least squares in the real world
Least squares lives inside countless real systems:
- Fitting linear models
- Kalman filters
- Camera calibration
- Gaussian processes
- Matrix factorization in recommender systems
- Gradient descent initialization
In machine learning, least squares even hides in the background of nonlinear models: linearized steps in optimizers, trust-region methods, and small subproblems in training often reduce to solving a least squares system.
Thus, accuracy in least squares does not just improve a single computation — it strengthens entire pipelines.
Where QR enters the picture
Whether through Gram–Schmidt (rare), Modified GS (better), or Householder reflections (preferred), QR decomposition serves as the workhorse for stable least-squares solutions.
Which brings us to the key question:
Why do modern numerical libraries overwhelmingly choose QR as the default solver for least squares?
To answer that, we now step into a comparison that reveals the strengths — and limits — of QR.
Shohei Shimoda
I organized and output what I have learned and know here.タグ
検索ログ
Development & Technical Consulting
Working on a new product or exploring a technical idea? We help teams with system design, architecture reviews, requirements definition, proof-of-concept development, and full implementation. Whether you need a quick technical assessment or end-to-end support, feel free to reach out.
Contact Us