6.2 Memory Advantages

6.2 Memory Advantages

When people first learn Cholesky decomposition, the part that stands out is usually its elegance: a symmetric positive definite matrix A becomes A = LLT, where L is lower triangular. But as you build real numerical systems, something much more practical emerges—something that has little to do with beauty and everything to do with the limits of real hardware:

Cholesky decomposition uses exactly half the memory footprint of LU decomposition.

That simple fact is not an academic curiosity. It is one of the major reasons Cholesky is used throughout machine learning, physics, robotics, econometrics, and large-scale statistics. Memory is not an infinite abstraction—it is a physical constraint that shapes algorithmic design. And in the modern era of large models and massive datasets, memory is often the first resource to run out.


Why Half the Memory Matters

Suppose you have an n × n matrix. A dense LU decomposition stores:

  • The L matrix (full lower triangular with unit diagonal)
  • The U matrix (full upper triangular)
  • Potentially pivot information

Even if implemented efficiently, LU stores almost the entire matrix twice across its factors. That’s an O(n²) memory cost—twice the base matrix, sometimes more.

By contrast, Cholesky stores:

  • Only one triangular matrix (either lower or upper)
  • Implicitly representing the entire decomposition

This cuts memory usage nearly in half. For a 10,000 × 10,000 matrix, the difference is dramatic:

  • LU → roughly 800 MB
  • Cholesky → roughly 400 MB

And this estimate leaves out temporary buffers, workspace, and alignment overhead that grow with problem size.

On a modern GPU, this difference can determine whether your computation fits in memory at all. On a CPU, it can decide whether your decomposition runs in cache or spills into slower memory tiers. And in distributed systems, it may change how many nodes your job needs.


Memory Locality: The Hidden Accelerator

It is not just about raw memory footprint—memory locality plays a critical role. Cholesky’s triangular access pattern tends to fit more easily into:

  • L1/L2 CPU caches
  • GPU shared memory
  • Vectorized memory paths

That means:

  • fewer cache misses
  • fewer memory stalls
  • more contiguous memory operations
  • better BLAS-3 performance (large matrix-matrix ops)

As a result, even on hardware where memory capacity is not a problem, memory locality can offer a speedup of 2×–5× over LU decomposition.

This is one reason libraries like NumPy, SciPy, PyTorch, JAX, TensorFlow, and LAPACK treat Cholesky as a first-class operation: it aligns naturally with the strengths of modern hardware.


Why Symmetry Makes Memory Efficiency Possible

You might wonder: why can’t we simply optimize LU decomposition the same way?

The answer lies in symmetry.

An arbitrary matrix has no guarantees about reflection across the diagonal. LU must preserve every element needed to reconstruct A. But an SPD matrix is symmetric:

Aij = Aji

That means:

  • the top-right triangle contains no new information
  • storing the lower triangle alone is sufficient
  • the decomposition can mirror itself with a transpose

This structure is what makes Cholesky both possible and efficient.

In numerical computation, structure is leverage. The more structure you can exploit, the more efficient the algorithm becomes.


Memory Efficiency = Speed

In large computational workloads, memory access—not CPU arithmetic—is often the bottleneck. Modern CPUs and GPUs can perform billions of FLOPs per second, but they cannot always feed data to the compute units fast enough. This mismatch is called the “memory wall”.

By halving memory movement, Cholesky:

  • reduces bandwidth pressure
  • reduces cache thrashing
  • reduces PCIe and NVLink traffic
  • reduces memory latency

Fewer memory operations also mean fewer opportunities for rounding errors. So Cholesky tends to be:

  • faster
  • more stable
  • more numerically predictable

This makes a noticeable difference in iterative algorithms, where the decomposition may be used repeatedly—sometimes thousands of times per training epoch.


Scaling to High Dimensions

The higher the dimension, the more dramatic the impact. In machine learning, covariance matrices, kernel matrices, or Hessians can easily reach sizes like:

  • n = 10,000 → 100 million entries
  • n = 50,000 → 2.5 billion entries
  • n = 100,000 → 10 billion entries

At this scale:

  • LU decomposition becomes nearly impossible
  • memory blow-up slows or halts the process
  • distributed compute becomes mandatory

But Cholesky remains surprisingly manageable because its memory footprint scales gracefully.

In practice, Cholesky often determines the upper limit of what “fits” on a single GPU or machine. Many ML algorithms rely on it specifically because it brings problems back within the range of feasible computation.


Simplicity as an Advantage

Another subtle advantage is that Cholesky’s code path is short. It has fewer branches, fewer conditional checks, and fewer operations. This is why:

  • numerical libraries can optimize it more aggressively
  • GPU kernels can be fused more easily
  • automatic differentiation is cleaner and cheaper
  • backprop through Cholesky is more stable than LU

When training large models, these details accumulate into significant real-world speedups.


Memory Efficiency Shapes Algorithm Choice

Many algorithms that could use LU decomposition instead choose Cholesky simply because of memory limitations. For example:

  • Gaussian process regression
  • Kalman filtering and smoothing
  • Kernel methods (SVMs, kernel ridge regression)
  • Graph Laplacian operations
  • Statistical covariance updates

In many systems, the choice is not “LU vs Cholesky”—the choice is:

Use Cholesky or the computation is impossible.

Understanding these memory dynamics is crucial to building reliable numerical systems.


Where We Go Next

Now that we’ve seen why Cholesky’s memory characteristics are so powerful, we can explore something even more important: how these properties make Cholesky indispensable across real applications.

Next: 6.3 Applications in ML, statistics, kernels

Here we’ll look at how Cholesky powers the algorithms behind Gaussian processes, kernel machines, Bayesian methods, probabilistic modeling, and large-scale numerical systems.

2025-09-29

Shohei Shimoda

I organized and output what I have learned and know here.