2.4 Vector and Matrix Storage in Memory

When we think about vectors and matrices, we usually imagine them as neat, two-dimensional grids. Rows and columns. Boxes with numbers. A simple abstraction that helps us reason about algorithms.

But computers don’t see grids. They only see linear streams of bytes. Every matrix you have ever used—whether in NumPy, PyTorch, TensorFlow, MATLAB, LAPACK, or a custom C++ pipeline—is secretly just a long one-dimensional array.

How that long array is laid out in memory shapes everything: speed, cache behavior, the efficiency of BLAS routines, and even numerical stability. Understanding memory layout is the final piece of the computational model, and it lets us reason about performance and correctness at the same time.

Row-major vs Column-major

The first question any numerical system must answer is:

How do we map a 2D matrix to a 1D block of memory?

There are two dominant conventions:

Row-major (C, C++, Python/NumPy by default, PyTorch)
Column-major (Fortran, MATLAB, Julia, NumPy with order='F')

In row-major, rows are stored contiguously:

[ a11 a12 a13 | a21 a22 a23 | a31 a32 a33 ]

In column-major, columns are stored contiguously:

[ a11 a21 a31 | a12 a22 a32 | a13 a23 a33 ]

Mathematically, they represent the same object. Computationally, they behave very differently.

Why Layout Matters: Cache Locality

Modern CPUs rely heavily on cache locality—keeping nearby data close to the processor to avoid slow main memory access.

When a program accesses a single value, the CPU loads an entire cache line (typically 32–128 bytes). If your algorithm walks through memory in the layout’s natural order, it enjoys:

fewer cache misses
higher throughput
faster BLAS routines

But if your algorithm reads memory in the “wrong direction” for the layout, you lose all those benefits. Walking rows in a column-major layout—or columns in a row-major layout—can make a program literally 10× slower.

This is one of the reasons algorithms that are theoretically equivalent can behave very differently in practice.

Stride: The Hidden Parameter That Explains Everything

Every vector and matrix has a hidden property:

the stride — how many memory steps you must take to move by one unit in a dimension.

For a 2D matrix, there are two:

row stride: bytes skipped to move down one row
column stride: bytes skipped to move across one column

In row-major layout:

row stride = number of columns
column stride = 1

In column-major layout:

row stride = 1
column stride = number of rows

This matters because BLAS libraries rely on stride values to decide:

which memory access pattern to use,
whether vectorization is possible,
whether tiled computation will be cache-friendly.

This is why optimized libraries often require contiguous matrices: they want predictable strides.

Contiguous vs Non-contiguous Memory

Operations such as slicing, transposing, or selecting a submatrix rarely copy data. Instead, they adjust strides while pointing to the same underlying buffer.

This makes these operations fast—but it also means:

Your matrix might not actually be contiguous, even if it looks like one.

A transposed matrix is the classic example: its rows become columns, and columns become rows. This changes stride values dramatically.

Some libraries handle non-contiguous cases seamlessly. Others silently copy the matrix, leading to hidden performance hits.

This is why working efficiently with large data requires understanding when:

an operation changes layout,
a copy occurs,
a BLAS call will be fast or slow.

Block Storage and Tiling

High-performance libraries (MKL, OpenBLAS, cuBLAS, LAPACK) rarely operate on entire rows or columns. Instead, they use tiled algorithms, breaking matrices into smaller blocks that fit into cache.

This technique:

reduces cache misses,
increases parallelism,
matches CPU and GPU architecture.

Understanding how matrices lie in memory helps explain why tiled algorithms are stable and fast—and why naïve implementations are slow.

Alignment, Padding, and Memory Boundaries

Many CPUs require data to be aligned to specific boundaries (8 bytes, 16 bytes, 32 bytes). Optimized numerical libraries ensure alignment so SIMD instructions (SSE, AVX, AVX-512) can process multiple values in one operation.

This is why:

arrays are often padded,
blas routines handle odd-sized matrices carefully,
GPU libraries tend to enforce strict alignment rules.

Alignment is invisible in high-level languages, but it is fundamental to performance and stability.

Why Memory Layout Influences Numerical Accuracy

At first glance, layout seems like a performance issue—not a numerical one. But the two are deeply connected.

When data accesses are cache-efficient, the CPU executes operations faster, meaning fewer pipeline stalls and fewer opportunities for instruction reordering. This stability reduces the risk of:

excessive rounding errors,
unpredictable execution order,
loss of determinism in parallel computations.

In large-scale machine learning, the order in which operations occur can subtly influence results. Memory layout affects that order.

This is yet another example of the core message of this book:

numerical reliability depends as much on engineering details as on mathematics.

Memory Layout Shapes Algorithms

Once you understand how matrices are stored in memory, many “mysteries” become clear:

Why multiplying by a transposed matrix often triggers a hidden copy.
Why vectorized operations are sometimes 10× faster.
Why some BLAS functions require column-major inputs.
Why libraries disagree about default memory format.
Why slicing can silently degrade performance.

Every algorithm we will study—from LU to QR to SVD—interacts heavily with layout. Memory determines iteration patterns, cache behavior, and overall speed.

The Road Ahead

We now have the final piece of the computational model: floating-point arithmetic, numerical limits, and memory layout. Together, they define the world in which numerical algorithms actually live.

Understanding how memory organizes vectors and matrices prepares us for the next chapter, where we stop looking at floating-point numbers and start examining the floating-point structures we build from them.

In the next chapter, we turn to the beating heart of numerical computation: vectors and matrices as they behave in practice—their norms, errors, conditioning, and the difference between algebraic definitions and implemented reality.

Let's step forward to:
3.0 Vectors and Matrices in Practice.

2025-09-11

matrix storage

row major

column major

memory layout

numerical computing

Shohei Shimoda

I organized and output what I have learned and know here.

カテゴリー

Computation & Mathematical Systems

Microsoft Teams Bots

検索ログ

Hello World bot 754 IT assistant bot 701 Deploy Teams bot to Azure 700 Microsoft Bot Framework 677 Adaptive Card Action.Submit 645 Microsoft Graph 626 Bot Framework example 613 Graph API token 610 Bot Framework Adaptive Card 599 Zendesk Teams integration 596 Adaptive Cards 593 Microsoft Teams Task Modules 593 ServiceNow bot 591 Teams app zip 591 Teams bot development 588 Bot Framework CLI 587 Teams chatbot 585 Azure App Service bot 584 Azure bot registration 584 Teams production bot 584 Azure Bot Services 580 Azure CLI webapp deploy 579 Task Modules 576 bot for sprint updates 574 Bot Framework proactive messaging 569 Teams bot packaging 565 Teams bot tutorial 562 Microsoft Entra ID 560 Bot Framework prompts 558 C 557

Development & Technical Consulting

Working on a new product or exploring a technical idea? We help teams with system design, architecture reviews, requirements definition, proof-of-concept development, and full implementation. Whether you need a quick technical assessment or end-to-end support, feel free to reach out.

2.4 Vector and Matrix Storage in Memory

2.4 Vector and Matrix Storage in Memory

Row-major vs Column-major

Why Layout Matters: Cache Locality

Stride: The Hidden Parameter That Explains Everything

Contiguous vs Non-contiguous Memory

Block Storage and Tiling

Alignment, Padding, and Memory Boundaries

Why Memory Layout Influences Numerical Accuracy

Memory Layout Shapes Algorithms

The Road Ahead

Shohei Shimoda

カテゴリー

タグ

検索ログ

Development & Technical Consulting