7.3 Least Squares Problems

7.3 Least Squares Problems

Least squares problems appear everywhere in applied mathematics, data science, and machine learning. Any time we try to fit a model, estimate parameters, denoise signals, or approximate relationships, we are (knowingly or unknowingly) solving a least squares problem.

Formally, given a matrix A and a vector b, we wish to find x that minimizes:

‖Ax − b‖₂

This is the geometric version of the problem. Algebraically, it expands to minimizing the sum of squared residuals. Practically, it means: find the best possible solution when the system has no exact solution.

Least squares is the mathematical backbone of:

  • Linear regression
  • Signal reconstruction
  • Time-series smoothing
  • Computer vision calibration
  • Machine learning (as an inner subproblem)
  • Overdetermined physical measurement systems

Because of its universality, the way we solve least squares matters enormously. A “good-enough” method in theory can become dangerously unstable in floating-point arithmetic.


The geometry of least squares

Least squares has an incredibly clean geometric interpretation:

Solve Ax ≈ b by projecting b onto the column space of A.

Imagine the columns of A as vectors forming a subspace. The vector b lives somewhere in the ambient space. The goal is to find the point in the column space that is closest to b. That point is A x̂, where is the least squares solution.

This geometric viewpoint immediately reveals something important:

Solving least squares is fundamentally about orthogonality.

If we can build an orthonormal basis for the column space of A, everything becomes easy and numerically stable.


The normal equations — and their pitfalls

The most “textbook” approach to solving least squares is using the normal equations:

Aᵀ A x = Aᵀ b

This looks elegant, but it hides a dangerous problem.

The matrix Aᵀ A squares the condition number of A.

That means:

  • Ill-conditioned systems become catastrophically ill-conditioned.
  • Small errors in A or b become amplified.
  • The system becomes unstable even for moderate datasets.

Moreover, forming Aᵀ A explicitly often introduces unnecessary rounding errors. Floating-point summation is not exact, especially for large-scale data.

So while the normal equations are mathematically valid, in practice they often lead to:

  • Poor accuracy
  • Unstable estimates
  • Incorrect solutions under floating-point arithmetic

This is why numerical linear algebra textbooks, libraries, and ML frameworks strongly recommend against using them except for small, well-behaved systems.


QR decomposition to the rescue

If we decompose A = QR where Q is orthogonal and R is upper triangular, the least squares problem becomes:

‖QRx − b‖₂

Because Q preserves norms, this simplifies neatly to:

‖Rx − Qᵀ b‖₂

Now R is triangular, so solving the least squares system becomes:

R x = Qᵀ b

This is stable, efficient, and avoids forming Aᵀ A. This method is the foundation of robust least-squares solvers everywhere.

Two crucial advantages:

  • Orthogonality preserves numerical accuracy
  • The triangular system is easy to solve and well-behaved

Because QR handles rank-deficient or nearly dependent columns better than normal equations, it serves as the standard approach in ML, statistics, and scientific computing.


SVD: the ultimate least-squares solver

Although QR is excellent, the most robust solver for least squares is the Singular Value Decomposition (SVD).

SVD directly reveals:

  • Rank deficiencies
  • Near-linear dependence
  • Conditioning and numerical sensitivity

It computes the pseudoinverse in a stable way:

x = A⁺ b

This is the gold standard solver, especially for ill-conditioned systems. The trade-off is cost: SVD is slower than QR.

Thus, numerical libraries follow a rule of thumb:

  • QR for most problems
  • SVD when accuracy outweighs performance

Least squares in the real world

Least squares lives inside countless real systems:

  • Fitting linear models
  • Kalman filters
  • Camera calibration
  • Gaussian processes
  • Matrix factorization in recommender systems
  • Gradient descent initialization

In machine learning, least squares even hides in the background of nonlinear models: linearized steps in optimizers, trust-region methods, and small subproblems in training often reduce to solving a least squares system.

Thus, accuracy in least squares does not just improve a single computation — it strengthens entire pipelines.


Where QR enters the picture

Whether through Gram–Schmidt (rare), Modified GS (better), or Householder reflections (preferred), QR decomposition serves as the workhorse for stable least-squares solutions.

Which brings us to the key question:

Why do modern numerical libraries overwhelmingly choose QR as the default solver for least squares?

To answer that, we now step into a comparison that reveals the strengths — and limits — of QR.

2025-10-04

Shohei Shimoda

I organized and output what I have learned and know here.