Skip to main content

Command Palette

Search for a command to run...

Cache Lines: The Invisible Boundary Your .NET Code Keeps Crossing

Updated
14 min readView as Markdown
Cache Lines: The Invisible Boundary Your .NET Code Keeps Crossing
P
Senior Software Engineer specialising in cloud architecture, distributed systems, and modern .NET development, with over two decades of experience designing and delivering enterprise platforms in financial, insurance, and high-scale commercial environments. My focus is on building systems that are reliable, scalable, and maintainable over the long term. I’ve led modernisation initiatives moving legacy platforms to cloud-native Azure architectures, designed high-throughput streaming solutions to eliminate performance bottlenecks, and implemented secure microservices environments using container-based deployment models and event-driven integration patterns. From an architecture perspective, I have strong practical experience applying approaches such as Vertical Slice Architecture, Domain-Driven Design, Clean Architecture, and Hexagonal Architecture. I’m particularly interested in modular system design that balances delivery speed with long-term sustainability, and I enjoy solving complex problems involving distributed workflows, performance optimisation, and system reliability. I enjoy mentoring engineers, contributing to architectural decisions, and helping teams simplify complex systems into clear, maintainable designs. I’m always open to connecting with other engineers, architects, and technology leaders working on modern cloud and distributed system challenges.

You often find that a performance problem in .NET is not caused by one dramatic mistake. It usually come from a small, ordinary decision repeated millions of times. One of those decisions is how your data sits in memory. You can write perfectly sensible C# and still end up making the CPU work harder than it needs to. Not because the algorithm is wrong. Not because the GC is broken. Not because you forgot to use Span<T>. Its because the data is laid out in a way the CPU doesnt like. Thats where cache lines come in.

The basic idea

Modern CPUs do not normally fetch one byte or one integer from memory. They fetch a small block of memory at a time. That block is called a cache line. On most modern desktop and server CPUs, a cache line is commonly 64 bytes. You should treat that as a useful mental note rather than a law of nature. Hardware can vary. The important point is this, when your code reads one value, the CPU often pulls nearby values into cache as well. Thats brilliant when your code walks through memory in order. It is much less brilliant when your code jumps around.

If you read items[0], there is a good chance items[1], items[2], and the next few values came along for the ride. Thats why arrays are often fast. Their elements sit beside each other. Its also why object heavy designs can become expensive in hot paths. The references may be beside each other, but the actual objects can live somewhere else entirely.

Why this matters in C#

C# does a good job of keeping you away from the raw details of memory, and most of the time that's exactly what you want. Normal business code shouldn't have to care about cache lines, object layout, or what the CPU happens to be pulling into cache. But once code gets hot enough, those details start to matter. If you're processing millions of messages, scanning large documents, running matching logic, pushing telemetry through a pipeline, or doing heavy batch work in the background, the shape of your data becomes part of the design.

At that point, performance isn't just about the code you write. Its also about how easy you make it for the CPU to move through the data.

Consider these two approaches:

public sealed class CustomerRecord
{
    public int Id { get; init; }
    public int Age { get; init; }
    public decimal Balance { get; init; }
    public bool IsActive { get; init; }
}

This is a normal object model. It is readable. It is easy to work with. It is the kind of thing most of us would write without thinking twice.

Now compare it with this:

public readonly struct CustomerRecord
{
    public readonly int Id;
    public readonly int Age;
    public readonly decimal Balance;
    public readonly bool IsActive;

    public CustomerRecord(int id, int age, decimal balance, bool isActive)
    {
        Id = id;
        Age = age;
        Balance = balance;
        IsActive = isActive;
    }
}

If you store the class version in an array, you get an array of references. The references are contiguous. The objects they point to may not be. If you store the struct version in an array, the records themselves are stored inline inside the array. That difference can be important.

References beside each other are not the same as data beside each other

This is the part that trips people up.

An array of classes gives you contiguous references.

CustomerRecord[] as classes

Array:
[ref][ref][ref][ref][ref][ref]

Heap:
 ref ──> object
 ref ─────────> object
 ref ─────> object
 ref ───────────────> object

An array of structs gives you contiguous records.

CustomerRecord[] as structs

Array:
[record][record][record][record][record][record]

That doesnt mean structs are always better. Large structs can be copied accidentally. Mutable structs can create bugs. Passing big structs around by value can make performance worse. The point is narrower than that. If your hot path repeatedly scans a lot of data, contiguous data often gives the CPU an easier job.

A small benchmark

Let’s make the idea concrete. Imagine we need to count active customers over and over again. Heres a simple class based version:

public sealed class CustomerClass
{
    public int Id { get; init; }
    public int Age { get; init; }
    public decimal Balance { get; init; }
    public bool IsActive { get; init; }
}

And a struct based version:

public readonly struct CustomerStruct
{
    public readonly int Id;
    public readonly int Age;
    public readonly decimal Balance;
    public readonly bool IsActive;

    public CustomerStruct(int id, int age, decimal balance, bool isActive)
    {
        Id = id;
        Age = age;
        Balance = balance;
        IsActive = isActive;
    }
}

Now benchmark a scan:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<CacheLineBenchmarks>();

[MemoryDiagnoser]
public class CacheLineBenchmarks
{
    private CustomerClass[] _classes = null!;
    private CustomerStruct[] _structs = null!;

    [Params(10_000, 1_000_000)]
    public int Count { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        _classes = new CustomerClass[Count];
        _structs = new CustomerStruct[Count];

        for (var i = 0; i < Count; i++)
        {
            var isActive = (i & 1) == 0;

            _classes[i] = new CustomerClass
            {
                Id = i,
                Age = 30 + (i % 40),
                Balance = i,
                IsActive = isActive
            };

            _structs[i] = new CustomerStruct(
                id: i,
                age: 30 + (i % 40),
                balance: i,
                isActive: isActive);
        }
    }

    [Benchmark]
    public int CountActiveClasses()
    {
        var total = 0;
        var customers = _classes;

        for (var i = 0; i < customers.Length; i++)
        {
            if (customers[i].IsActive)
            {
                total++;
            }
        }

        return total;
    }

    [Benchmark]
    public int CountActiveStructs()
    {
        var total = 0;
        var customers = _structs;

        for (var i = 0; i < customers.Length; i++)
        {
            if (customers[i].IsActive)
            {
                total++;
            }
        }

        return total;
    }
}

The exact numbers will depend on your machine, runtime version, CPU, and benchmark setup, so don’t get too attached to one result. What matters is the pattern the benchmark exposes.

The class version walks through an array of references, then follows each reference to reach the actual object. That extra hop can mean more cache misses. The struct version scans through records stored inline in the array, which usually gives the CPU a much easier path through memory.

The CPU likes boring loops

A tight loop over contiguous memory is boring in the best possible way.

for (var i = 0; i < values.Length; i++)
{
    total += values[i];
}

The access pattern is predictable, the CPU can prefetch ahead & each cache line gives you several useful values.

Now compare that with pointer chasing:

var node = head;

while (node is not null)
{
    total += node.Value;
    node = node.Next;
}

A linked list can look clean in code, but it’s often rough on the CPU. Each node points to the next one, and that next node could be sitting somewhere completely different in memory.

That makes the access pattern hard to predict. Instead of moving smoothly through nearby data, the CPU keeps waiting for the next piece to arrive from memory. That waiting is where a lot of performance disappears.

Cache locality beats cleverness more often than people expect

A lot of code is written as if the CPU cost is mostly about the number of operations. Thats only part of the story.

This loop:

for (var i = 0; i < values.Length; i++)
{
    total += values[i];
}

can be extremely fast because the data access is predictable.

This code can be much slower, even if the operation looks similar:

foreach (var index in randomIndexes)
{
    total += values[index];
}

The second version jumps around. It may touch a new cache line for each read. Same array. Same kind of value. Very different memory behaviour. Thats the lesson. Performance is not only about what you do, its also about where the data is when you do it.

False sharing: when two threads fight over one cache line

Cache lines become even more interesting when multiple threads are involved.

Imagine two counters:

public sealed class Counters
{
    public long CounterA;
    public long CounterB;
}

Two different threads update them:

// Thread 1
counters.CounterA++;

// Thread 2
counters.CounterB++;

At first glance, there is no shared state problem. The threads are updating different fields. But those fields may sit on the same cache line. When one core updates CounterA, the cache line containing both counters has to be coordinated with other cores. When another core updates CounterB, the same thing happens again. The threads are not logically fighting over the same value. At the hardware level, they may still be fighting over the same cache line.

Thats false sharing.

One cache line

[ CounterA ][ CounterB ][ other bytes... ]
      ↑           ↑
   Core 1      Core 2

The fix is usually to separate frequently written values so they do not share the same cache line. That doesnt mean you should start padding every class in your application. It means you should know this problem exists when building high contention structures.

Padding can help, but it is not free

You may see code like this in low level libraries:

using System.Runtime.InteropServices;

[StructLayout(LayoutKind.Explicit, Size = 128)]
public struct PaddedCounter
{
    [FieldOffset(64)]
    public long Value;
}

This padding spreads values out so independent counters are less likely to land on the same cache line. The 128-byte size isn’t magic. It’s a defensive choice that gives the code some breathing room across different hardware and runtime layout details.

Padding still has a cost, though. If you create millions of padded items, you can burn a lot of memory and make the system worse overall. That’s the trade-off with low-level performance work. The trick can be valid, but it still has to pay its way.

Data-oriented design in normal .NET code

You don't need to turn your application into a game engine to care about data layout. You can apply the idea in small, practical places.

Suppose you have a batch job that validates a million rows. In normal business code, it's natural to model each row as an object with all the fields the application might need. That's fine for readability, but if the hot path only checks two or three values, pulling the full object shape through memory can be wasteful.

A typical object model might look like this:

public sealed class ValidationItem
{
    public string Reference { get; init; } = "";
    public int SchemeId { get; init; }
    public int StatusId { get; init; }
    public DateTime CreatedAt { get; init; }
    public bool RequiresReview { get; init; }
}

Thats probably fine for normal business logic. But if a hot path only checks StatusId and RequiresReview, dragging the full object shape through memory may be wasteful.

A more data oriented shape could split hot fields from cold fields:

public readonly struct ValidationHotFields
{
    public readonly int StatusId;
    public readonly bool RequiresReview;

    public ValidationHotFields(int statusId, bool requiresReview)
    {
        StatusId = statusId;
        RequiresReview = requiresReview;
    }
}

The full object model can still exist where it makes sense. You’re not replacing the domain model or turning the whole application inside out. Youre just giving the performance critical path a smaller shape to work with. It gets the fields it actually needs, laid out in a way that's easier for the CPU to scan.

Row layout versus column layout

Most application code stores records as rows.

Row layout

[Customer 1: Id, Age, Balance, IsActive]
[Customer 2: Id, Age, Balance, IsActive]
[Customer 3: Id, Age, Balance, IsActive]

Thats natural when you usually work with one customer at a time.

But analytics style code often wants one field across many records.

Column layout

Ids:       [1][2][3][4][5]
Ages:      [31][32][33][34][35]
Balances: [10][20][30][40][50]
Active:   [true][false][true][true][false]

If you only need to scan Active, the column layout is compact and cache-friendly.

Here is a tiny example:

public sealed class CustomerColumns
{
    public int[] Ids { get; }
    public int[] Ages { get; }
    public decimal[] Balances { get; }
    public bool[] ActiveFlags { get; }

    public CustomerColumns(int count)
    {
        Ids = new int[count];
        Ages = new int[count];
        Balances = new decimal[count];
        ActiveFlags = new bool[count];
    }
}

Counting active customers becomes:

public static int CountActive(CustomerColumns customers)
{
    var total = 0;
    var activeFlags = customers.ActiveFlags;

    for (var i = 0; i < activeFlags.Length; i++)
    {
        if (activeFlags[i])
        {
            total++;
        }
    }

    return total;
}

That loop touches only the data it needs. This is one reason column stores are powerful for analytical workloads. They avoid pulling irrelevant fields into cache.

The trap: making everything low-level

The dangerous version of this advice is to start treating every allocation, class, and object reference as a mistake. That's how you end up with code that looks clever in a benchmark and awful in a real application. Most .NET code should still be readable first. Request handlers, admin screens, normal CRUD flows, and code that runs a few times per request don't need to be bent around cache lines. You'll usually get more value from clear models, simple control flow, and code the next developer can safely change.

Cache line thinking belongs where the code is genuinely hot. If you’re writing a parser, serialiser, queue, search loop, batch processor, or telemetry pipeline, memory access can become a real part of the cost. That’s when it’s worth shaping the data around the work being done. The win isn’t making cold code clever. It’s keeping normal code normal, and being more deliberate when a hot path proves it needs a different shape.

How to investigate cache behaviour in .NET

BenchmarkDotNet is the right place to start. It wont tell you everything about the CPU cache, but it gives you a safe way to compare two designs without guessing.The important thing is to make the benchmark realistic. Use data sizes large enough that everything doesnt neatly fit in cache, run proper release builds, and keep allocations visible with [MemoryDiagnoser]. Tiny benchmarks can make bad designs look fine because the CPU never really gets stressed.

If you need deeper evidence, you can move into profiling tools. PerfView and Visual Studio profiling are useful on Windows, while perf on Linux can expose lower level details like cache misses and CPU counters, depending on your environment and permissions. You wont always need to go that far. A simple benchmark is often enough to show the shape of the problem, especially when scattered object access is losing to a more contiguous layout.

Practical rules of thumb

The practical advice is simple enough, keep scanned data close together, keep hot data compact, and don’t drag rarely used fields through a loop that doesn’t need them. Arrays are often your friend here because they give the CPU a predictable path through memory. That doesn’t mean reaching for structs everywhere. Large structs can create their own problems, and linked structures can still be the right shape when the code isn’t performance sensitive. The point is to be more deliberate when the path is genuinely hot.

False sharing is worth keeping in mind too, especially in high contention multi-threaded code. Two values can be logically independent and still end up fighting over the same cache line.

Measure before and after. Cache line thinking helps you make a better guess about what to test, but it doesn’t replace measurement.

A cache line is only a small block of memory, but it explains a lot of performance behaviour that can look strange from normal C#. It helps explain why arrays are fast, why pointer chasing hurts, and why two threads updating different fields can still slow each other down. It also explains why a clean object model can be the wrong shape for a hot loop. The point isn’t that every .NET developer needs to write low level code every day. Most of the time, you shouldnt. But when code gets hot enough, the CPU’s preferences start to matter, and your data layout becomes part of the design.