2.4 Vector and Matrix Storage in Memory
2.4 Vector and Matrix Storage in Memory
When we think about vectors and matrices, we usually imagine them as neat, two-dimensional grids. Rows and columns. Boxes with numbers. A simple abstraction that helps us reason about algorithms.
But computers don’t see grids. They only see linear streams of bytes. Every matrix you have ever used—whether in NumPy, PyTorch, TensorFlow, MATLAB, LAPACK, or a custom C++ pipeline—is secretly just a long one-dimensional array.
How that long array is laid out in memory shapes everything: speed, cache behavior, the efficiency of BLAS routines, and even numerical stability. Understanding memory layout is the final piece of the computational model, and it lets us reason about performance and correctness at the same time.
Row-major vs Column-major
The first question any numerical system must answer is:
How do we map a 2D matrix to a 1D block of memory?
There are two dominant conventions:
- Row-major (C, C++, Python/NumPy by default, PyTorch)
- Column-major (Fortran, MATLAB, Julia, NumPy with
order='F')
In row-major, rows are stored contiguously:
[ a11 a12 a13 | a21 a22 a23 | a31 a32 a33 ]
In column-major, columns are stored contiguously:
[ a11 a21 a31 | a12 a22 a32 | a13 a23 a33 ]
Mathematically, they represent the same object. Computationally, they behave very differently.
Why Layout Matters: Cache Locality
Modern CPUs rely heavily on cache locality—keeping nearby data close to the processor to avoid slow main memory access.
When a program accesses a single value, the CPU loads an entire cache line (typically 32–128 bytes). If your algorithm walks through memory in the layout’s natural order, it enjoys:
- fewer cache misses
- higher throughput
- faster BLAS routines
But if your algorithm reads memory in the “wrong direction” for the layout, you lose all those benefits. Walking rows in a column-major layout—or columns in a row-major layout—can make a program literally 10× slower.
This is one of the reasons algorithms that are theoretically equivalent can behave very differently in practice.
Stride: The Hidden Parameter That Explains Everything
Every vector and matrix has a hidden property:
the stride — how many memory steps you must take to move by one unit in a dimension.
For a 2D matrix, there are two:
- row stride: bytes skipped to move down one row
- column stride: bytes skipped to move across one column
In row-major layout:
row stride = number of columns column stride = 1
In column-major layout:
row stride = 1 column stride = number of rows
This matters because BLAS libraries rely on stride values to decide:
- which memory access pattern to use,
- whether vectorization is possible,
- whether tiled computation will be cache-friendly.
This is why optimized libraries often require contiguous matrices: they want predictable strides.
Contiguous vs Non-contiguous Memory
Operations such as slicing, transposing, or selecting a submatrix rarely copy data. Instead, they adjust strides while pointing to the same underlying buffer.
This makes these operations fast—but it also means:
Your matrix might not actually be contiguous, even if it looks like one.
A transposed matrix is the classic example: its rows become columns, and columns become rows. This changes stride values dramatically.
Some libraries handle non-contiguous cases seamlessly. Others silently copy the matrix, leading to hidden performance hits.
This is why working efficiently with large data requires understanding when:
- an operation changes layout,
- a copy occurs,
- a BLAS call will be fast or slow.
Block Storage and Tiling
High-performance libraries (MKL, OpenBLAS, cuBLAS, LAPACK) rarely operate on entire rows or columns. Instead, they use tiled algorithms, breaking matrices into smaller blocks that fit into cache.
This technique:
- reduces cache misses,
- increases parallelism,
- matches CPU and GPU architecture.
Understanding how matrices lie in memory helps explain why tiled algorithms are stable and fast—and why naïve implementations are slow.
Alignment, Padding, and Memory Boundaries
Many CPUs require data to be aligned to specific boundaries (8 bytes, 16 bytes, 32 bytes). Optimized numerical libraries ensure alignment so SIMD instructions (SSE, AVX, AVX-512) can process multiple values in one operation.
This is why:
- arrays are often padded,
- blas routines handle odd-sized matrices carefully,
- GPU libraries tend to enforce strict alignment rules.
Alignment is invisible in high-level languages, but it is fundamental to performance and stability.
Why Memory Layout Influences Numerical Accuracy
At first glance, layout seems like a performance issue—not a numerical one. But the two are deeply connected.
When data accesses are cache-efficient, the CPU executes operations faster, meaning fewer pipeline stalls and fewer opportunities for instruction reordering. This stability reduces the risk of:
- excessive rounding errors,
- unpredictable execution order,
- loss of determinism in parallel computations.
In large-scale machine learning, the order in which operations occur can subtly influence results. Memory layout affects that order.
This is yet another example of the core message of this book:
numerical reliability depends as much on engineering details as on mathematics.
Memory Layout Shapes Algorithms
Once you understand how matrices are stored in memory, many “mysteries” become clear:
- Why multiplying by a transposed matrix often triggers a hidden copy.
- Why vectorized operations are sometimes 10× faster.
- Why some BLAS functions require column-major inputs.
- Why libraries disagree about default memory format.
- Why slicing can silently degrade performance.
Every algorithm we will study—from LU to QR to SVD—interacts heavily with layout. Memory determines iteration patterns, cache behavior, and overall speed.
The Road Ahead
We now have the final piece of the computational model: floating-point arithmetic, numerical limits, and memory layout. Together, they define the world in which numerical algorithms actually live.
Understanding how memory organizes vectors and matrices prepares us for the next chapter, where we stop looking at floating-point numbers and start examining the floating-point structures we build from them.
In the next chapter, we turn to the beating heart of numerical computation: vectors and matrices as they behave in practice—their norms, errors, conditioning, and the difference between algebraic definitions and implemented reality.
Let's step forward to:
3.0 Vectors and Matrices in Practice.
Shohei Shimoda
I organized and output what I have learned and know here.タグ
検索ログ
Development & Technical Consulting
Working on a new product or exploring a technical idea? We help teams with system design, architecture reviews, requirements definition, proof-of-concept development, and full implementation. Whether you need a quick technical assessment or end-to-end support, feel free to reach out.
Contact Us