Post

CS Roadmap Part 10 — Synchronization Primitives: How Does a Mutex Let Only One Thread In?

CS Roadmap Part 10 — Synchronization Primitives: How Does a Mutex Let Only One Thread In?
Prerequisites — Read these first
TL;DR — Key Takeaways
  • The reason two threads touching the same variable die only sometimes is that read-modify-write is not atomic. A lock is the abstraction "only one thread in the critical section at a time," and at the bottom of that abstraction there is always a hardware atomic instruction (x86 LOCK CMPXCHG, ARM LDXR/STXR)
  • Mutex/Semaphore/RWLock/Spinlock/CondVar are just different policies on top of the same atomic. Linux puts the fast path in user space via futex and only enters the kernel on the slow path; Windows SRWLock and macOS os_unfair_lock follow the same idea
  • The real cost of a lock is not the instruction itself but cache line bouncing. Under MESI, when one core grabs the lock, the corresponding line in other cores is Invalidated, and cross-socket the cost is 10× higher. False sharing is the trap that causes this unintentionally
  • Unity's Job System builds a JobHandle dependency DAG that prevents read/write conflicts, and AtomicSafetyHandle catches races at Schedule() time (Editor/Development builds). Burst leaves normal NativeArray reads/writes as plain load/store and only emits hardware atomics for explicit Interlocked.* calls. DOTS analyzes ComponentSystem read/write to parallelize without locks
  • Unreal separates Game/Render/RHI/Audio Threads and queues commands via TaskGraph + FRenderCommandFence. ENQUEUE_RENDER_COMMAND has no visible lock — it enqueues commands that the Render Thread executes in FIFO order
Visitors

Hits

Introduction: What “Dies Only Sometimes” Really Means

When Stage 2 began, we asked one question.

Why does a program with two threads writing to the same variable die only sometimes?

After Part 7 — OS Architecture, Part 8 — Process and Thread, and Part 9 — Scheduling, we’ve reached half the answer. The OS swaps threads in invisible places, and those swap moments are unpredictable — that justifies the word “sometimes.”

This part covers the other half. A single line of code that reads, modifies, and writes is in fact not atomic, and we’ll see how the OS and CPU cooperate to manufacture the abstraction “only one at a time.”

What we’ll cover:

  • The nature of a race condition: why counter++ loses data
  • The lock family: Mutex / Semaphore / RWLock / Spinlock / CondVar / Monitor / Barrier
  • How locks are built: Peterson → Test-and-Set → CAS → hardware atomic
  • OS-specific primitives: Linux futex, Windows SRWLock, macOS os_unfair_lock
  • Hardware mechanism: CPU cache hierarchy, MESI, how atomic grabs cache line ownership, false sharing, memory barrier
  • Unity synchronization (deep dive): Main Thread model, Job System, NativeContainer, AtomicSafetyHandle, Burst, DOTS — how data flows between cores
  • Unreal synchronization: Game/Render/RHI Thread separation, TaskGraph, ENQUEUE_RENDER_COMMAND internals
  • Game engine patterns: lockless ring buffer, double buffer to avoid locks, frame-locked sync

It sounds long, but compressed to one sentence: a lock is an abstraction, that abstraction sits on atomic instructions, and atomic instructions sit on cache line ownership. We’ll work our way down from the top.

Part 1: The Nature of a Race Condition

A one-line mystery

Look at this code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
static int counter = 0;

void Worker() {
    for (int i = 0; i < 1'000'000; ++i) {
        counter++;
    }
}

int main() {
    std::thread t1(Worker);
    std::thread t2(Worker);
    t1.join(); t2.join();
    std::cout << counter << "\n";  /* expected: 2,000,000 */
}

Two threads each added 1 a million times, so the result should be two million. But run it and you get a different number every time, almost always less than two million. On a desktop you’ll commonly see somewhere between 1,200,000 and 1,800,000.

The reason values vanish is that counter++ is one line but three steps in machine code. The figure below traces, in time order, what happens when two threads touch the same counter at the same time. The moment one increment is lost becomes obvious.

counter++ — split into three steps, and the moment a race occurs
time →
Thread A (Core 0)
counter (memory)
Thread B (Core 1)
t₀
load eax ← [counter]
eax = 41
41
·
t₁
add eax, 1
eax = 42
41
load eax ← [counter]
eax = 41
t₂
store [counter] ← eax
counter = 42
42↑ A's +1
add eax, 1
eax = 42
t₃
·
42✗ B overwrote +1
store [counter] ← eax
counter = 42
Expected: counter = 43 (A and B each +1) · Actual: counter = 42 — one increment is gone
Cause: when B read 41 at t₁, A's result (42) wasn't in memory yet — read-modify-write is not atomic

race condition vs data race — not the same thing

Hold on, let’s settle this. Are race condition and data race the same thing?

They’re often used interchangeably but they differ academically.

  • Data race: two threads access the same memory without synchronization, and at least one is a write. This is explicitly defined in multiple language memory models. In C++/Rust it is undefined behavior, while Java/Go define limited behavior within their memory models — not UB, but the result of a program that is not “correctly synchronized” can violate intuition.
  • Race condition: a broader concept where the result depends on thread execution order. Even if two threads communicate safely through atomic variables, if business-logic outcomes depend on who arrives first, that’s a race condition.

So every data race causes a race condition, but not every race condition is a data race. Locks remove data races; race conditions are a design problem at a higher layer.

Atomicity, visibility, ordering — three guarantees

Synchronization primitives promise three things.

  1. Atomicity: an operation happens as a single indivisible unit, with no observable intermediate state
  2. Visibility: a value written by one thread is guaranteed to become visible to others
  3. Ordering: memory operations are seen by other threads in the order the program intended

counter++ broke because of atomicity. But visibility and ordering are separate problems too — CPUs reorder instructions, and caches don’t synchronize instantly. All three are tackled in depth in Part 12 (memory model and atomic operations), but they sit in the background throughout this part.

Hold on, let’s settle this. Can a plain read or plain write also be non-atomic?

Yes. On x86/ARM, naturally aligned 4-byte/8-byte reads and writes are generally atomic (separate from C++’s std::atomic guarantees). But misaligned access, 16-byte SIMD, or 64-bit values on a 32-bit CPU can split a single store into two and let another core observe a half-written value. That’s why in C++ we use std::atomic<T> instead of a plain variable — to tell the compiler “this really must be atomic.”

What a lock promises

The simplest way to keep counter++ from breaking:

1
2
3
4
5
6
std::mutex m;
/* ... */
{
    std::lock_guard<std::mutex> lk(m);
    counter++;
}

While lock_guard holds m, no other thread can enter on the same m. The result is always two million.

Two questions follow naturally.

  1. How does the OS manufacture the “one at a time” promise? — Part 3
  2. What does that promise cost? — Part 5

First let’s sort out the kinds of locks and how they differ.


Part 2: The Lock Family

Comparison at a glance

NameEssenceWho releases?Wait styleTypical use
Mutex1-slot locklocking threadsleepprotecting a critical section
Recursive Mutexre-entrant Mutexlocking threadsleepnested calls from same thread
Spinlock1-slot locklocking threadbusy-waitvery short critical sections, kernel
SemaphoreN-slot counterany threadsleepresource pools, producer-consumer
RWLockmany readers / one writerlocking threadsleepread-heavy data
Condition Variablewait + signalwaking threadsleepcondition-based synchronization
MonitorMutex + CondVar bundle(per object)sleepJava synchronized
Barrierwait for N threads to arriveonce all arrivesleepparallel-stage sync
Latchone-shot countdownwhen count hits 0sleepinit-complete signal

Quick notes on each.

Mutex (Mutual Exclusion)

The most basic. Holds state 0 or 1, and only the thread that locked it can unlock it. If someone else has it, the incoming thread joins the wait queue and sleeps.

1
2
3
4
std::mutex m;
m.lock();
// critical section
m.unlock();

In C++ you almost always use a RAII wrapper like std::lock_guard or std::unique_lock. They guarantee unlock even on exception.

Hold on, let’s settle this. reentrant, recursive, thread-safe — three terms that get confused.

  • Thread-safe: a function/object behaves correctly when called concurrently from multiple threads. A property observed from outside.
  • Reentrant: a thread can re-enter a function it’s already executing (via interrupt or signal) safely. Conditions include no global variable use, no static buffers, etc.
  • Recursive (lock): the same thread can acquire a lock it already holds. The counter goes up, and it must be released the same number of times.

Recursive mutex is convenient but usually a sign of poor lock design. If a deep function doesn’t know it already holds the lock and tries to acquire it again, that means the interface design left lock points unspecified.

Spinlock

Same state as Mutex but the failing thread does not sleep — it keeps retrying.

1
2
3
while (lock.test_and_set(std::memory_order_acquire)) {
    /* busy-wait */
}

When is Spinlock better than Mutex? When the critical section is very short and context-switch cost outweighs spin cost. A classic case: kernel interrupt handlers can’t sleep at all, so only spinlock will do.

Conversely, if the critical section is long or you have more threads than cores, spinlock is a disaster. You’d burn CPU cycles on time that should be slept away.

Hold on, let’s settle this. What’s the difference between a spinlock and plain busy-wait?

A spinlock is a lock built on atomic primitives, while a plain busy-wait is just polling a regular variable. Polling a plain variable lets the compiler hoist the loop (while(flag); becomes an infinite loop) or never see a change made by another core. A spinlock prevents both via atomic + memory ordering.

Semaphore

The oldest synchronization primitive, proposed by Dijkstra in 1965. A non-negative integer counter with two operations.

  • P (wait, acquire): sleep if counter is 0, otherwise decrement by 1
  • V (signal, release): increment by 1, wake one waiter if any

A Mutex is really “a binary semaphore initialized to 1.” But the semantics differ — a Mutex must be released by the locking thread, while a Semaphore can be released by anyone. That makes Semaphore suitable for resource pools (free slots in a connection pool) or expressing “remaining capacity / items” in producer-consumer queues.

Reader-Writer Lock

Many readers, one writer. Throughput improves when reads vastly outnumber writes.

1
2
3
4
5
6
7
8
9
std::shared_mutex m;
{
    std::shared_lock<std::shared_mutex> r(m);  /* multiple readers OK */
    // read
}
{
    std::unique_lock<std::shared_mutex> w(m);  /* exclusive writer */
    // write
}

Two pitfalls. First, the RWLock itself costs more than a plain Mutex. For very short critical sections, plain Mutex can be faster. Second, writer starvation — if readers keep arriving, a writer can wait forever. Most implementations provide a writer-preferring mode.

Condition Variable

Expresses “wait until a condition is satisfied.” Always used paired with a Mutex.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
std::mutex m;
std::condition_variable cv;
bool ready = false;

/* Waiter */
{
    std::unique_lock<std::mutex> lk(m);
    cv.wait(lk, []{ return ready; });
    // woke up, ready == true
}

/* Notifier */
{
    std::lock_guard<std::mutex> lk(m);
    ready = true;
}
cv.notify_one();

The key is that cv.wait atomically does “release lock, sleep, on wake reacquire lock.” Without that, a notify arriving just before wait causes the lost wakeup problem.

Also, you pass the wait predicate as a lambda because of spurious wakeup. The standard permits CV to wake without the condition being satisfied, so you must re-check after waking.

Monitor

An abstraction that bundles Mutex + Condition Variable per object. Java’s synchronized keyword and wait/notify attached to every object are exactly this pattern. C#’s lock block is the same.

1
2
3
4
5
6
lock (gameState) {
    while (!gameState.Ready)
        Monitor.Wait(gameState);
    /* work */
    Monitor.PulseAll(gameState);
}

The monitor is embedded in the gameState object header so you don’t have to declare a separate mutex. Convenient, but the lock is tightly bound to one object, making lock granularity hard to adjust.

Barrier / Latch

Barrier: waits until N threads have all arrived at one point. Frame-locked parallel processing in games — “N physics update jobs must all finish before moving on” — is the canonical example. Reusable (cyclic barrier).

Latch: everyone waits until a counter reaches 0. Single-use. Useful for signaling once to all starting threads that initialization is done.

Summary: which lock when

The matrix below places locks on two axes — critical section length (horizontal) and read/write ratio (vertical) — and marks the suitable primitive. At a glance you can see how tools with the same atomic foundation but different policies find their place.

Suitable synchronization primitive by situation
read/write balance
Short (< 1μs)
Medium (1~100μs)
Long / with I/O
read ≈ write
Spinlock
Kernel/interrupt only. Mostly forbidden in userspace
Mutex
Default choice. futex/SRWLock/os_unfair_lock
Mutex + CondVar
Long waits: sleep + condition signal
read ≫ write
Mutex / atomic
RWLock overhead may be higher
RWLock (Shared)
Watch for writer starvation
RCU / Seqlock
Lock-free reads, protected writes
Resource pool / Queue
Semaphore
Slot count = counter
Semaphore + CondVar
Standard producer-consumer
Lock-free Queue
SPSC ring / Vyukov MPSC
Phase sync
atomic flag
Short phase signal
Barrier / Latch
N threads sync at once
JobHandle / Fence
Engine-level dependency
Suitable Possible but careful (overhead/forbidden) — Every cell sits on the same atomic CAS underneath. Only the policy changes.

That covers the interface layer that users see. Now we go one layer down and see how the OS and compiler manufacture these promises.


Part 3: How Locks Are Built

Attempt 1 — Can software alone do it?

Start with the naive attempt. Can we build a lock for two threads using just one variable?

1
2
3
4
5
6
int locked = 0;

void lock() {
    while (locked) ;       /* spin */
    locked = 1;            /* got it! */
}

Obviously broken. Two threads that pass while(locked) simultaneously execute locked = 1 at the same time and both enter the critical section. There’s an empty gap between read and write where a race happens.

Peterson’s algorithm (1981)

You can build a lock in software alone. For two threads, the most elegant solution is Gary Peterson’s algorithm from 1981.

Peterson's algorithm — two threads attempting to enter simultaneously
Shared variables flag[2] = {0, 0} turn = 0 flag[i] = "I want to enter", turn = "next to yield"
Thread A (self=0)
1flag[0] = 1declare intent to enter
2turn = 1yield to B
3while (flag[1] && turn == 1) ;wait if B also wants in and it's B's turn
4enter critical section
5flag[0] = 0unlock
Thread B (self=1)
1flag[1] = 1declare intent to enter
2turn = 0yield to A
3while (flag[0] && turn == 0) ;spinning ...
4(wait until A unlocks)
5then enter critical section
Why does only one pass? Even when both threads execute step 2 nearly simultaneously, turn can hold only a single value (the last write wins). If that value is 0, A has yielded so B passes; if 1, vice versa — either way, exactly one thread escapes the while at step 3.
But it doesn't work on real CPUs. From other cores' perspective, steps 1 and 2 may appear reordered (memory reorder), and store buffers prevent your own writes from being visible to other cores immediately. You ultimately need a memory barrier — which is a hardware instruction.

The principle is combining “I yield” intent with “whose turn is it” agreement. Even when both threads enter at once, turn can hold only a single value, so only one passes.

Does this actually work? In theory yes, on real CPUs no.

Two reasons.

  1. CPU instruction reordering: the order in which flag[self] = 1; turn = other; reaches memory can differ from code order. From another core, you might see turn change first while flag[self] still looks like 0.
  2. Store buffer: each core has a write buffer, so values you wrote aren’t immediately visible to other cores.

Solving both requires a memory barrier, which is essentially a hardware instruction. So software alone won’t do.

Attempt 2 — Hardware helps

What we fundamentally need is a single instruction that does “read-compare-write” atomically. CPU vendors provide special instructions for this. The three most common families, summarized in one figure:

Hardware atomic primitives — three families
Test-and-Set (TAS)
"write a value and get the old one back"
read
old = *X
write
*X = true
in one step — indivisible
x86XCHG · LOCK BTS
ARMLDXR / STXR pair
Compare-and-Swap (CAS)
"if it equals expected, swap to new"
read
cur = *X
compare
cur == exp ?
conditional write
*X = desired
returns success/failure boolean
x86LOCK CMPXCHG
ARMLDXR / STXR · CAS (v8.1+)
Fetch-and-Add (FAA)
"add and get the old value back"
read
old = *X
+
add and write
*X = old + δ
returns old
x86LOCK XADD
ARMLDADD (v8.1+)
↓ build a spinlock on top
One CAS makes one spinlock cycle
1
acquire attempt
CAS(lock, 0, 1)
✓ success
lock = 1
enter critical section
✗ fail (already 1)
PAUSE hint
retry
2
critical section
acquire-release
memory barrier included
3
release
store(lock, 0)

CAS is the most general and powerful — it’s the basic unit of lock-free data structures (covered in Part 13). TAS suffices for simple mutual exclusion like a spinlock, and FAA is the core operation for counter increments and ticket locks.

Hold on, let’s settle this. Are “CAS” and “atomic” the same?

No. Atomic operation is the general concept of “any operation that happens as a single indivisible step.” CAS is one kind of atomic operation, with siblings TAS, FAA, LL/SC, and atomic load/store. C++’s std::atomic<T> is an interface that abstracts these instructions, where compare_exchange_weak maps to CAS and fetch_add to FAA.

CAS-based spinlock implementation

Building a spinlock with CAS:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
struct Spinlock {
    std::atomic<bool> locked{false};

    void acquire() {
        bool expected = false;
        while (!locked.compare_exchange_weak(
                expected, true,
                std::memory_order_acquire)) {
            expected = false;       /* gets clobbered on CAS failure */
            while (locked.load(std::memory_order_relaxed))
                __builtin_ia32_pause();  /* x86 PAUSE hint */
        }
    }

    void release() {
        locked.store(false, std::memory_order_release);
    }
};

Two things worth noting.

  1. acquire / release memory order: ensures values written after acquiring the lock are visible to threads that later acquire the same lock. Details in Part 12.
  2. PAUSE hint: on x86, inserting PAUSE in spin loops reduces power consumption and avoids memory-order violation penalties. ARM uses YIELD for a similar purpose.

Spin or sleep — the decision rule

The same atomic underneath, just different policy, gives you spinlock or mutex. Which to use comes down to comparing critical section length against context-switch cost.

ConditionSpin winsSleep wins
Critical section length< 1μs> 10μs
Threads vs cores>
Environmentinterrupt handler, RT kernelnormal user-space code

Modern mutex implementations are therefore adaptive. They spin briefly first, then fall back to sleep if they still can’t acquire. Linux glibc’s NPTL does this; Windows SRWLock does too.


Part 4: OS-specific Primitives

Linux futex — fast userspace mutex (2002)

Does pthread_mutex_lock need a syscall every time? If the lock is uncontended, there’s no reason to call into the kernel. That insight gave rise to futex.

Hubertus Franke, Rusty Russell, and Matthew Kirkwood introduced it into Linux in 2002. The core idea:

The pseudo-code commonly cited is a 3-state mutex implementation example (0 unlocked, 1 locked-no-waiters, 2 locked-with-waiters). But that state encoding is a user-space library’s policy choice, not part of futex itself — futex is the kernel-provided general wait/wake building block, and abstractions like mutex/semaphore/condvar are built on top with their own state encodings. The implementation above needs two bits because unlock must know whether waiters exist to decide whether to call the wake syscall.

A single uncontended futex operation costs roughly 10–20ns, less than one tenth of a syscall (~hundreds of ns). That’s why most user-space blocking lock implementations on Linux (pthread_mutex, sem_t, glibc’s std::mutex, etc.) are built on top of futex.

futex — fast path vs slow path branching
state 0 unlocked 1 locked · no waiters 2 locked · waiters present
mutex_lock()
USER SPACE — fast path
CAS(m, 0, 1)
→ success: lock acquired, done
→ failure: someone else holds it
~10ns · one CAS · no syscall
KERNEL — slow path
atomic_xchg(m, 2) · mark "waiters present"
futex_wait(m, 2) syscall · enter kernel wait queue and sleep
~hundreds of ns + context switch · plus sleep time
mutex_unlock()
USER SPACE — fast path
fetch_sub(m, 1) → check previous value
→ was 1: no waiters, done
→ was 2: waiters present
~10ns · one atomic · no syscall
KERNEL — slow path
store(m, 0) · unlock
futex_wake(m, 1) syscall · wake one waiter
~hundreds of ns · woken thread retries lock
Key insight — relies on the statistical fact that the lock is uncontended 90%+ of the time. When uncontended, everything finishes with one atomic in user space — avoiding the cost of calling the kernel. Only on contention does it enter the kernel and register on the wait queue. Windows SRWLock and macOS os_unfair_lock are variations of the same idea.

Hold on, let’s settle this. How are “fast path” and “slow path” generally split?

In lock implementations, the fast path is the route that finishes as fast as possible when there’s no contention, and the slow path is the route that needs extra work when there is contention. Nearly every modern synchronization primitive — futex, SRWLock, os_unfair_lock, parking_lot — follows this pattern. Fast path: 1–2 CASes. Slow path: kernel entry or separate wait-queue management. The design relies on the statistical fact that locks are uncontended over 90% of the time.

Windows SRWLock and CRITICAL_SECTION

Windows has multiple generations of synchronization primitives, with different tradeoffs at each era.

Mutex (kernel object): the oldest. Handle-based, can be shared across processes. But every lock/unlock is a syscall, making it slow — ~hundreds of ns.

CRITICAL_SECTION: introduced in NT 4 (1996). Handles the fast path with user-space atomics and only sleeps via the WaitForSingleObject kernel event on contention. Same idea as futex, five years earlier. Drawbacks: a heavy struct (~40 bytes including internal counter and handle) and only works within a single process.

SRWLock (Slim Reader/Writer Lock): introduced in Windows Vista (2007). 8-byte pointer size, fast/slow path design almost identical to futex, with RWLock semantics. No init function needed (= SRWLOCK_INIT). For new code, SRWLock is almost always the answer.

1
2
3
4
5
6
7
8
9
SRWLOCK lock = SRWLOCK_INIT;

AcquireSRWLockExclusive(&lock);   /* writer lock */
/* ... */
ReleaseSRWLockExclusive(&lock);

AcquireSRWLockShared(&lock);      /* reader lock */
/* ... */
ReleaseSRWLockShared(&lock);

Internally, SRWLock packs the following into a single 8-byte atomic: locked bit, waiting bit, waking bit, reader count, and high bits of a wait-queue head pointer. Bit-packing is what makes a single CAS suffice for the fast path.

Condition Variable: CONDITION_VARIABLE, paired with SRWLock, was added in the same era.

Hold on, let’s settle this. Which to use: CRITICAL_SECTION or Mutex?

For new code within a single process, SRWLock is the answer. Lighter and supports RWLock. CRITICAL_SECTION is for compatibility with older code; Mutex (kernel object) is only when you need inter-process synchronization or a named mutex.

macOS os_unfair_lock and family

macOS is based on BSD pthread, but adds Mach port–based synchronization and Apple-specific locks.

pthread_mutex_t: the standard. Internally uses the Mach __ulock_wait / __ulock_wake syscalls (macOS’s futex equivalent). Same fast/slow path structure.

os_unfair_lock: introduced in macOS 10.12 / iOS 10 (2016). Created to replace OSSpinLock. OSSpinLock was vulnerable to priority inversion — if a low-priority thread held the lock and was demoted to E cores, a high-priority thread could spin forever.

1
2
3
4
5
os_unfair_lock lock = OS_UNFAIR_LOCK_INIT;

os_unfair_lock_lock(&lock);
/* ... */
os_unfair_lock_unlock(&lock);

The “unfair” in the name is intentional. It abandons FIFO fairness and instead records information about the holding thread in the lock. When priority inversion is detected, it temporarily boosts the holder’s priority (priority donation). This is enabled by encoding the owner’s thread ID in the 4-byte atomic.

Within os_unfair_lock’s 4 bytes is the owner thread’s mach thread port id. This lets the kernel know who holds the lock on the contended path and automatically perform QoS inheritance (boost). It connects directly to the QoS coverage in Part 9.

OSSpinLock: deprecated. Never use in new code.

NSLock / @synchronized: Objective-C object-level monitor. Every NSObject can potentially have a lock. Same idea as Java’s synchronized but costly.

Dispatch semaphore (dispatch_semaphore_t): GCD’s counting semaphore. P/V semantics.

The same idea across three OSes

OSuncontended (fast)contended (slow)recommended new-code lock
Linuxatomic CASfutex(WAIT/WAKE) syscallpthread_mutex, std::mutex
Windowsatomic CASNtWaitForAlertByThreadId (Win8+)SRWLock
macOSatomic CAS__ulock_wait/wake syscallos_unfair_lock

Three OSes provide different names and APIs, but share these traits:

  1. Fast path: 1–2 user-space atomics
  2. Slow path: kernel wait queue only
  3. For RWLock, add reader count via bit packing

However, priority inheritance/donation handling differs by OS. macOS os_unfair_lock encodes the owner thread ID in 4 bytes so the kernel can immediately know the boost target. Linux records the owner only in the separate PI_futex mode (enabled via PTHREAD_PRIO_INHERIT) to support priority inheritance for RT tasks. Windows SRWLock and CRITICAL_SECTION don’t guarantee priority inheritance by default — we revisit this in Part 11 (deadlock and priority inversion).

The C++ standard library, Rust parking_lot, and Java j.u.c.locks are all built on top of these OS APIs.


Part 5: How Hardware Implements Locks

So far we’ve said “CAS happens atomically.” But how is that actually possible? A mechanism inside the CPU must make only one core pass through when multiple cores touch the same address at the same time. That mechanism is cache coherence, and atomic instructions ride on top of it.

CPU cache hierarchy — why multiple levels?

Modern CPUs don’t access memory directly. Multiple cache levels sit between cores and DRAM.

LevelSize (per core / shared)Access timeOwner
Register~320 cyclecore only
L1 D-cache32–48KB4–5 cyclecore only
L1 I-cache32–48KB4–5 cyclecore only
L2256KB–1MB12–15 cyclecore only (usually)
L3 (LLC)4–64MB30–50 cycleshared among same-socket cores
DRAMGB100–300 cycle (200–400ns)everyone
DRAM on other socketGB200–600 cycleNUMA

The cache unit is the cache line, and on almost all modern x86/ARM systems it’s 64 bytes. When a CPU reads one byte from memory, it fetches all 64.

Hold on, let’s settle this. If each core has its own L1, can the same variable have different values across cores?

Solving exactly that problem is the cache coherence protocol. Multiple cores can hold copies of the same cache line, but the moment one core writes to that line, the others’ copies are invalidated or updated to prevent inconsistency. The most widely used protocols are MESI (Intel) and its MOESI variant (AMD).

MESI protocol

Each cache line is in one of four states.

StateMeaningOther cores
M (Modified)only this core has it, differs from DRAM (dirty)no copies
E (Exclusive)only this core has it, matches DRAM (clean)no copies
S (Shared)multiple cores have the same value (clean)same value present
I (Invalid)this core’s copy is invalid(it’s elsewhere)

There’s just one core rule.

If a cache line is in M state, only that core has it.

When a write happens, that line in all other cores drops to I. To make this possible, cores exchange coherence messages.

MessageMeaning
Read“give me this line (for reading)”
Read-for-Ownership (RFO)“give me this line (for writing)” — request to invalidate other cores’ copies
Invalidate“discard your copy of this line”
Read-Response“here’s that line”

One-line summary — a 4-state machine (M/E/S/I) that tracks who owns a cache line and whose copy is still valid. When one core writes to a line, the others drop to I, and this is the mechanism atomic instructions use to manufacture “one at a time.”

The next visualization traces a single cache line moving between two cores across 7 steps. It’s a deep appendix, so it’s collapsed — expand when you need it.

▸ MESI state transitions in detail — every step of a cache line moving between cores
A cache line's MESI transitions — when two cores access the same line
state M Modified E Exclusive S Shared I Invalid
Event
Core 0 line state
Core 1 line state
coherence message
t₀ Initial: nobody has it
I
I
t₁ Core 0 read
I → E
I
Read → DRAM → response
t₂ Core 0 write (atomic CAS)
E → M
I
local — no message (already Exclusive)
t₃ Core 1 attempts read
M → S
I → S
Read → Core 0 supplies dirty data, DRAM updated
t₄ Core 1 write (atomic CAS) — bouncing begins
S → I
S → M
RFO (Read-for-Ownership) → Core 0's copy invalidated
t₅ Core 0 writes again
I → M
M → I
RFO → pulls dirty data from Core 1 and invalidates Core 1
t₆ Core 1 writes again
M → I
I → M
RFO → ping-pong continues
The M-I toggle of the same line is cache line bouncing. Intra-socket it's 30–50ns per round trip, cross-socket 150–300ns. The LOCK CMPXCHG instruction itself is ~10ns when the cache line is already in M state, but going through RFO makes it 30–100× more expensive. This is the real cost of lock contention.

What an atomic instruction actually does

When LOCK CMPXCHG executes on x86:

  1. CPU fetches the target memory address’s cache line into Modified state (RFO message invalidates other cores’ copies)
  2. While the line is in M state — even if other cores send RFO simultaneously, the cache coherence mechanism serializes them — compare and swap proceeds
  3. The line stays in M or gets evicted at some later point

The point: atomic’s “atomicity” comes from the hardware fact that cache coherence serializes RFO requests. There isn’t a separate lock — the cache protocol’s invariant that a single line can only be in M state on one core at a time is the lock.

ARM is slightly different. ARM uses LL/SC (Load-Linked / Store-Conditional) pairs.

loop:
    LDXR  w0, [x1]        ; load-exclusive, tracks address x1
    CMP   w0, w_expected
    B.NE  fail
    STXR  w2, w_desired, [x1]  ; store-exclusive, fails if x1 was modified meanwhile
    CBNZ  w2, loop        ; retry on failure

LL/SC’s advantage: immunity to part of the CAS ABA problem. Its drawback: failure is possible so a retry loop is needed.

cache line bouncing — the real cost of locks

What happens when two cores take turns grabbing the same lock?

Core A takes the lock → cache line in M state on A Core B tries CAS → cache line moves from A to B; A becomes I, B becomes M Core A takes the lock again → moves from B to A; B becomes I, A becomes M

This ping-pong is cache line bouncing. Cost per bounce:

ScenarioCost
Cores sharing same L3 (intra-socket)~30–50ns
Different sockets (NUMA, cross-socket)~150–300ns
Different NUMA nodes~hundreds to 1000ns

The LOCK CMPXCHG instruction itself costs ~10ns when the line is already in M state. With bouncing, it becomes 30–100× more expensive. That’s the real cost of lock contention, and why the cache effect matters more than the lock itself.

Hold on, let’s settle this. If cache coherence guarantees consistency automatically, why do we still need locks?

They guarantee different things.

  • Cache coherence promises “all copies of one cache line will eventually be consistent.” Per-memory-location.
  • Lock promises “make a critical section spanning multiple memory locations behave as a single transaction.” Per-semantic-unit.

For example account_a -= x; account_b += x; spans two cache lines. Coherence guarantees consistency of each line, but a lock guarantees that both changes appear together.

False sharing — unintentional cache line bouncing

Consider this struct.

1
2
3
4
struct Counters {
    std::atomic<int> threadA_count;  /* offset 0  */
    std::atomic<int> threadB_count;  /* offset 4  */
};

Thread A only touches its counter, thread B only touches its own. Logically there’s no shared data. But if the two atomics sit in the same 64-byte cache line, A’s write invalidates B’s line and vice versa. They ping-pong meaninglessly. This is false sharing.

The fix is cache line alignment.

1
2
3
4
struct Counters {
    alignas(64) std::atomic<int> threadA_count;  /* line 0 */
    alignas(64) std::atomic<int> threadB_count;  /* line 1 */
};

C++17 provides the std::hardware_destructive_interference_size constant to know the cache line size at compile time.

In game engine code, false sharing typically shows up in:

  • per-thread stat arrays — int hits[NUM_THREADS] with adjacent slots
  • producer/consumer ring buffer head/tail pointers
  • arrays of small Job structures packed tightly

Measurement is hard. The CPU counter mem_load_uops_l3_hit_retired.xsnp_hitm (Intel) is one indicator that catches false sharing. Linux perf c2c automates this.

Memory barrier — the hardware side of ordering

Cache coherence makes values consistent, but ordering is a separate problem. CPUs reorder instructions, and store buffers briefly hold writes inside the issuing core.

1
2
3
4
5
6
7
/* Thread A */
data = 42;
ready = true;       /* is the order of these two stores guaranteed to other cores? */

/* Thread B */
if (ready)
    use(data);      /* is data really 42? */

x86 has TSO (Total Store Order) so store-store reordering doesn’t occur, but ARM is weakly ordered and the above code can break. ARM needs a memory barrier (DMB ST) between the two stores.

One thing a lock promises us is this ordering. mutex.unlock() internally includes a release-semantics barrier, and mutex.lock() includes an acquire-semantics barrier, so values written inside the lock appear consistently outside.

Barrierx86ARMMeaning
store-store(automatic)DMB STafter prior stores complete, then later stores
load-load(automatic)DMB LDafter prior loads complete, then later loads
store-loadMFENCEDMB SYprior store becomes globally visible, then later loads
fullMFENCEDMB SYserialize all memory operations

Part 12 handles this in depth. The point here is just that a lock is a bundle of atomic instructions + barriers.

Summary: what one lock causes

A mutex.lock() / mutex.unlock() pair traverses these steps, from user space to hardware cache:

mutex.lock() → critical section → mutex.unlock() — all layers
mutex.lock()
1
HW
request cache line ownership
send RFO message to other cores
~30–300ns
2
HW
invalidate other cores' copies
bring that line into I state and fetch data (becomes M)
RFO response
3
CPU
execute atomic instruction
LOCK CMPXCHG (x86) · LDXR/STXR (ARM): try 0 → 1
~10ns
✓ success — fast path done, enter critical section
✗ failure — proceed to slow path
4
OS
enter kernel wait queue
futex_wait / NtWaitForAlertByThreadId / __ulock_wait — register on wait queue and sleep
~hundreds ns + sleep
5
CPU
acquire ordering constraint
prevent memory ops after the atomic from being reordered before it — ARM uses DMB ISH or LDA*, x86 plain load suffices
~few ns
— critical section — no one else can enter on the same lock during this time
mutex.unlock()
6
CPU
release ordering constraint
prevent values written inside the lock from being reordered after the unlock store — ARM uses DMB ISH or STL*, x86 plain store suffices
~few ns
7
CPU
atomic store
set lock variable to 0 (including waiters-bit check)
~5ns
8
OS
wake waiters (only if present)
futex_wake / NtAlertThreadByThreadId / __ulock_wake
~hundreds ns
HWcache coherence / MESI CPUatomic instruction + barrier OSkernel wait queue (slow path only)

That’s the bottom of locks. Now we see how game engines solve synchronization on top of all this.


Part 6: Unity Synchronization — Main Thread, Job System, DOTS

Game engines don’t use locks as is. One frame at 60fps is 16.67ms, and a single lock contention can cost 100–300ns to a few μs. A thousand of them eat 1ms before anything else. So engines adopt lock-avoiding structures. Two ideas are central:

  1. Thread affinity: certain data is touched by only one thread (then no lock is needed at all)
  2. Dependency-based parallelism: instead of locks, build a dependency graph that prevents read/write conflicts at scheduling time

Unity does exactly these two.

Main Thread model

All Unity MonoBehaviour callbacks — Update, LateUpdate, FixedUpdate, OnGUI, OnTriggerEnter, etc. — execute on a single thread, the Main Thread. And almost every Unity API (Transform.position, GameObject.Find, Component.GetComponent, etc.) can only be called from the main thread. Calling from another thread throws UnityException: ... can only be called from the main thread.

This is an enormous simplification. There isn’t a single lock in the entire scene graph — because all access is on one thread.

Hold on, let’s settle this. Is Unity’s main thread OS thread #1?

“#1” is meaningless from the OS perspective. The main thread is whichever thread is created first when the Unity process starts; from the OS’s view it’s a normal pthread/Win32 thread. The Unity runtime simply records this thread’s ID and uses it for IsMainThread() checks. On macOS, the main thread is tied to NSRunLoop and runs alongside UI events.

The pitfall of the main thread model is one thing. Heavy work on the main thread is frame drops, period. That’s why Unity spins up worker threads separately and provides the abstraction called Job System on top.

Unity Job System

The Job System does two things at once.

  1. Parallel processing on native memory: doesn’t touch the managed heap, so GC-irrelevant
  2. Race detection at schedule time: dependency graph and NativeContainer safety check read/write conflicts. The attributes ([ReadOnly]/[WriteOnly]/[NativeDisableContainerSafetyRestriction]) are read by the compiler, but the actual conflict check is a runtime check at the time of Schedule() and is enabled only with the ENABLE_UNITY_COLLECTIONS_CHECKS macro (Editor/Development builds)

Basic usage:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
public struct ApplyVelocityJob : IJobParallelFor {
    [ReadOnly]  public NativeArray<float3> velocities;
    [WriteOnly] public NativeArray<float3> positions;
    public float deltaTime;

    public void Execute(int index) {
        positions[index] += velocities[index] * deltaTime;
    }
}

void Update() {
    var job = new ApplyVelocityJob {
        velocities = velocityBuffer,
        positions = positionBuffer,
        deltaTime = Time.deltaTime,
    };
    JobHandle handle = job.Schedule(positionBuffer.Length, 64);  /* batch=64 */
    handle.Complete();
}

Schedule doesn’t execute the job immediately — it registers it in the dependency graph. JobHandle tracks the job’s completion.

Internal flow — where does one call go?

Tracing what Schedule() triggers, step by step:

  1. Compile time: IL2CPP/Burst reads the [ReadOnly], [WriteOnly] attributes and marks velocities as read-only, positions as write-only
  2. Schedule call (main thread):
    • Creates a JobHandle and registers dependencies
    • Copies the Job struct into unmanaged memory (so workers don’t touch the main heap)
    • Inspects the NativeContainers’ AtomicSafetyHandle — if another in-flight job writes the same container, throws a runtime exception (not at compile time)
  3. Worker thread (Unity Job Worker N):
    • Pulls a job from its own wait list (work-stealing deque)
    • For IJobParallelFor, slices the index range into batches (64 each) and distributes
    • Calls Execute(index) for each batch
  4. Complete call (main thread):
    • handle.Complete() checks that all transitively dependent jobs are done in the dependency graph
    • If not done, the main thread also acts as a worker and steals jobs (work-stealing — main thread becomes a worker too)

How does data flow between cores?

Trace how positionBuffer is touched by a worker thread, in cache units.

  1. At Schedule time: the first 64 slots of positionBuffer (16 bytes × 64 = 1024 bytes = 16 cache lines) are in the main thread CPU’s L1/L2 (touched in the previous frame).
  2. Worker thread starts: worker N receives Execute(0) ~ Execute(63). If the worker is on a different core, RFO messages are issued for the 64 slots’ cache lines. The lines move from the main thread’s L1 to the worker core’s L1 — at 30–50ns each intra-socket, 16 lines means ~600ns warmup for the first batch.
  3. Batch execution: as the worker processes 64 sequentially, the cache lines stay in the worker core’s L1. With 16-byte slots, 4 per line, the line is touched 4 times in M state — cache hit.
  4. Next batch (64–127): another 16 lines move to the worker’s L1. The previous 16 lines get evicted or remain in worker’s L2/L3.
  5. At Complete time: when the main thread reads the results, all lines RFO back to the main CPU.

Two costs are visible:

  • First batch warmup: cache lines moving from main → worker
  • Result reclamation: lines moving back from worker → main

So if a Job’s input/output is small, the scheduling overhead itself can exceed the work cost. The batchCount parameter (64 above) of IJobParallelFor tunes this tradeoff. Too small means cache miss + dispatch overhead at every batch boundary; too large means load balancing breaks. Unity docs guide “16–64 is usually better than 1” for this reason.

The figure below shows the Job dependency DAG and worker mapping. The top is the dependency graph (logical order), the bottom is the actual worker thread mapping (time axis).

Unity Job System — dependency DAG and worker mapping
dependency DAG (logical order)
VelocityJob
read input
CollisionJob
read AABB
ApplyVelocityJob
write Position
RenderBoundsJob
read Position
CullingJob
read Position
read-after-write — dependency added automatically
worker thread mapping (actual execution, time →)
Main Thread
Update / Schedule
work-stealing
Complete
Worker 1
idle
Vel[0..15]
ApplyVel[0..15]
RenderB[0..15]
Worker 2
idle
Vel[16..31]
ApplyVel[16..31]
Culling[0..31]
Worker 3
idle
Collision[0..63]
wait dep
RenderB[16..31]
scheduleparallel readswritesparallel readssync
data flow — while Worker 1 runs ApplyVel[0..15]
at Schedule
positionBuffer[0..15] cache lines sit in main CPU L1
worker starts
RFO messages move the lines to the worker CPU's L1 (~30–50ns × 4 lines)
batch execution
16 slots = 4 lines, all stay in worker L1 (cache hit, M state)
Complete
result lines RFO back to main CPU (on read)
Not a single lock. Dependencies between Jobs are expressed as a graph, read-after-write conflicts are serialized at Schedule time, and jobs sharing the same read permission flow freely in parallel. AtomicSafetyHandle's runtime check at Schedule() time (Editor/Development builds only) catches container-permission conflicts and throws exceptions.

NativeContainer and AtomicSafetyHandle

Unity catches races even on non-GC native memory. How?

NativeContainers like NativeArray<T>, NativeList<T>, NativeQueue<T> carry an internal AtomicSafetyHandle. It’s debug metadata enabled only in Editor and Development builds. Roughly:

1
2
3
4
struct AtomicSafetyHandle {
    int version;          /* has the container been disposed? */
    AtomicSafetyNodePtr nodePtr;  /* manages read/write reader lists */
}

The handle tracks:

  • Jobs currently writing this container: 0 or 1 (if 1, no one else can even read)
  • Jobs currently reading this container: N (writers blocked)
  • Whether the container is alive (DisposeSentinel): throws immediately on access to a disposed container

At Schedule time, Unity compares the job’s [ReadOnly]/[WriteOnly] markings with the AtomicSafetyHandle state and throws an exception when it finds a possible conflict. It blocks the Schedule call itself.

1
2
3
4
5
6
var a = new NativeArray<int>(100, Allocator.TempJob);
var jobA = new WriteJob { data = a }.Schedule();          /* writes a */
var jobB = new ReadJob  { data = a }.Schedule();          /* reads a — conflict! */
/* InvalidOperationException: The previously scheduled job WriteJob writes
   to the NativeArray a. You must call JobHandle.Complete() on the job
   before you can read from the NativeArray safely. */

The fix is to express the dependency:

1
2
3
var handleA = jobA.Schedule();
var handleB = jobB.Schedule(handleA);   /* B after A */
handleB.Complete();

To bypass this, you add [NativeDisableContainerSafetyRestriction], but that’s a declaration of “I take responsibility for the race.” Use only when truly safe — e.g., two jobs with non-overlapping index ranges.

Hold on, let’s settle this. Does AtomicSafetyHandle work in production builds?

No. It’s enabled only in Editor and Development builds, gated by the ENABLE_UNITY_COLLECTIONS_CHECKS macro. Release builds strip all safety check code. This is not a static guarantee — a clean run in the Editor means for the code paths exercised at that time the safety system didn’t see a conflict, but it isn’t a proof that no race will occur under different inputs and different timing. So before shipping to production, you need tests that exercise as many code paths as possible, and [NativeDisableContainerSafetyRestriction] paths bypass the check entirely, so they require extra care.

Burst — how is atomic guaranteed?

In short — Burst compiles IL to native code via LLVM, but it does not turn ordinary array accesses into atomics. If you need atomic, you must call Interlocked.* explicitly, and only those get emitted as hardware atomic instructions. That’s the key point; the 4–5 steps below are compiler internals and are collapsed.

▸ Burst compile pipeline in detail — IL → native, atomic emit, NoAlias, SIMD mapping

A job tagged [BurstCompile] is compiled to native code instead of IL (LLVM backend). What Burst does when compiling NativeArray access:

  1. Ordinary array[i] reads/writes are plain load/store — not atomic. Races are possible, and conflict prevention depends on schedule-time dependency + safety system
  2. Only explicit atomic calls like Interlocked.* are emitted directly as hardware atomic instructions like x86 LOCK XADD / LOCK CMPXCHG or ARM LDADD / LDXR-STXR
  3. Keeps bounds checks SIMD-friendly, or enables them only in the Editor
  4. Uses [NoAlias] hints to assume ptr aliasing — enables more aggressive compiler optimization
  5. Maps SIMD intrinsics (Unity.Mathematics.float4) to SSE/AVX/NEON
1
2
3
4
5
6
7
8
9
10
11
12
[BurstCompile]
public struct CountJob : IJobParallelFor {
    [NativeDisableContainerSafetyRestriction]
    public NativeArray<int> counter;   /* length 1, all jobs share the same slot */

    public void Execute(int i) {
        unsafe {
            Interlocked.Increment(ref UnsafeUtility.As<int, int>(
                ref counter.GetUnsafePtrReadOnly()[0]));
        }
    }
}

The moment Burst sees Interlocked.Increment, on x86 it emits directly as LOCK INC or LOCK XADD. That is, the hardware atomic instructions from Part 5 happen exactly as is. Cache line bouncing costs apply just the same.

Hold on, let’s settle this. What happens when jobs write to slots in the same cache line simultaneously?

That’s exactly false sharing. In the example above, with a length-1 counter, every job does atomic increment, so that 1 byte (actually 4 bytes, 1 line) ping-pongs between cores endlessly. A million calls vanish 50–100ms in cache bouncing alone. The fix is per-thread local counters on different cache lines, summed at the end — [ThreadStatic] or NativeQueue<int>.Concurrent’s enqueue pattern.

DOTS / ECS — system-level dependencies

SystemBase or ISystem-based ECS takes the above mechanism one step further.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public partial struct MoveSystem : ISystem {
    public void OnUpdate(ref SystemState state) {
        new MoveJob { dt = SystemAPI.Time.DeltaTime }
            .ScheduleParallel();
    }
}

[BurstCompile]
public partial struct MoveJob : IJobEntity {
    public float dt;
    void Execute(ref LocalTransform t, in Velocity v) {
        t.Position += v.Value * dt;
    }
}

IJobEntity automatically extracts read/write component access from the signature (ref = write, in = read). The source generator creates metadata for this info at compile time, and at runtime the World scheduler reads it and handles the following automatically:

  • Two systems writing the same Component get automatic ordering
  • Systems touching different Components automatically run in parallel

This is the most extreme example of resolving synchronization not with locks but with a dependency graph. The programmer never locks, but the scheduler checks read/write permissions at Schedule time and serializes conflicts (Editor/Development builds get additional safety system checks).

Internally, ECS maintains dependency info per Component with ReaderWriterLock semantics. Combined with Burst, it looks like the diagram below.

DOTS ECS scheduler — automatic parallelization via Component permission analysis
Component permission analysis (at compile time)
SpawnSystem
write add/remove Entity
MoveJob
write LocalTransform.Position
read Velocity
RotateJob
write LocalTransform.Rotation
RenderBoundsJob
read LocalTransform.Position
write RenderBounds
scheduler reads read/write patterns and adds dependency edges automatically
execution timeline (Frame N)
Main
SpawnSystem
schedule jobs
— work-stealing —
SyncPoint
RenderSystem
W1
idle
MoveJob [Position write]
RenderBoundsJob [Position read]
idle
W2
idle
RotateJob [Rotation write]
idle (no Position dep)
idle
↑ same time — different Components, so parallel without locks ↑ Position read waits for MoveJob (automatic dependency)
The programmer specifies neither locks nor dependencies. The ECS scheduler reads each job/system's Component permissions (ref=write, in=read) and serializes only read-after-write conflicts. Result: work touching different Components runs in parallel freely; touching the same Component gets automatic ordering.

The dependency where RenderBoundsJob waits for MoveJob to complete is automatically inferred by ECS without the developer specifying it. Both jobs touch LocalTransform.Position, MoveJob writes and RenderBoundsJob reads, so the ECS scheduler automatically adds the dependency edge.

One frame’s memory flow — summary

Synchronization-related events Unity triggers within a single frame (16.67ms), in time order:

TimeLocationWhat happens
0msMainInput/MonoBehaviour Update — main thread only, no locks
2msMainSchedule jobs, build dependency graph
2~10msWorker 1~NWorkers execute jobs in dependency order; NativeArray cache lines RFO-move to worker cores
10msMainJobHandle.Complete() or sync point — main also acts as a worker via work-stealing
12msRender threadsubmit command buffer to GPU (next section)
16msGPUpresent, next frame begins

That covers Unity. Unreal has a similar philosophy but with stronger thread separation.


Part 7: Unreal Engine Synchronization

Unreal has a more explicit multi-thread model than Unity. The engine itself is split into multiple named threads, and inter-thread communication is built almost entirely on lock-free queues.

Four Named Threads

A typical Unreal game has these threads always running:

ThreadResponsibilityOS thread
Game ThreadTick, gameplay logic, Blueprint, AImain thread (Unity’s main equivalent)
Render Threadhigh-level render command generation (FRDG, RHI command authoring)separate OS thread
RHI ThreadGPU API calls (D3D12/Vulkan/Metal/Mantle)separate OS thread
Audio Threadsound mixing, voice managementseparate OS thread
Worker ThreadsTaskGraph job executionone per core

The core idea is each thread holds data only it touches, and inter-thread communication goes through explicit command queues. Instead of locking, data ownership is bound firmly to a thread.

Game Thread → Render Thread — one frame’s flow

Unreal’s Render Thread usually runs one frame behind the Game Thread (Epic’s docs describe it as 0 or 1 frame behind — if Game finishes fast Render can catch up; if Render is heavy, Game blocks on sync next frame). This article describes the “one frame behind” case under normal load. When Game runs frame N’s logic, Render builds frame N-1’s render commands, and RHI submits frame N-2 to the GPU.

Unreal 4-thread pipeline — same moment, different frames
Thread ↓ / Frame →
Frame N
Frame N+1
Frame N+2
Game Thread
N logic
Tick, AI, Physics, Blueprint
N+1 logic
N+2 logic
Render Thread
N-1 render
RDG build, FMeshBatch generation
N render
N+1 render
RHI Thread
N-2 submit
D3D12/Vulkan/Metal API
N-1 submit
N submit
GPU
N-3 draw + present
actual pixels displayed
N-2 draw + present
N-1 draw + present
→ ENQUEUE_RENDER_COMMAND
Game pushes data to Render (lock-free MPSC queue)
→ RHI command list
Render dispatches GPU commands to RHI
→ GPU submit
RHI submits to GPU, tracked by fence
← FRenderCommandFence
Only when Game needs to wait for Render (rare)
Only one thread touches the same data at the same moment. While Game makes N, Render handles N-1 and RHI handles N-2, so almost no lock is needed. The tradeoff is +1 frame of input lag — the exact realization of one-frame pipelining from Part 9.

Thanks to this pipelining, almost no lock is needed. Only one thread touches the same data at any moment.

TaskGraph — Unreal’s Job System

FTaskGraphInterface is the Unreal equivalent of Unity Job System.

1
2
3
4
5
6
7
8
9
10
11
12
FGraphEventRef MyTask = FFunctionGraphTask::CreateAndDispatchWhenReady(
    [Data]() {
        // executes on a worker thread
        ProcessData(Data);
    },
    TStatId(),
    nullptr,                      /* dependency prerequisite */
    ENamedThreads::AnyThread      /* which thread to run on */
);

/* wait for the result later */
FTaskGraphInterface::Get().WaitUntilTaskCompletes(MyTask);

Features:

  • Named thread selection: dispatch to GameThread, RenderThread, RHIThread, or AnyThread
  • Dependency edge expression: pass a prerequisite array to start after they complete
  • Automatic work-stealing: AnyThread is picked up by the least busy worker
  • Hierarchical tasks: spawn child tasks from within a task

ENQUEUE_RENDER_COMMAND — a command queue with no visible lock

The most common macro for passing data from Game Thread to Render Thread.

1
2
3
4
5
6
7
8
FVector NewPos = Actor->GetActorLocation();
FRHICommandListImmediate& RHICmdList = ...;

ENQUEUE_RENDER_COMMAND(UpdateActorPos)(
    [NewPos](FRHICommandListImmediate& RHICmdList) {
        /* this lambda executes on the Render Thread */
        UpdateConstantBuffer(RHICmdList, NewPos);
    });

What this macro means: wrap the lambda into a command object, enqueue it on the Render Thread’s command queue, and the Render Thread pulls in FIFO order to execute. Epic’s docs don’t explicitly guarantee the internal queue is lock-free MPSC (a version-dependent implementation detail), and the macro’s contract is just “ordered + execute on Render Thread”. Conceptually it’s a multi-producer (workers can enqueue too) single-consumer (Render Thread only pops) pattern, which is generally a natural fit for lock-free MPSC.

Hold on, let’s settle this. How does a lock-free MPSC queue work without locks?

The key is the asymmetry: producers append to the queue tail via atomic CAS or atomic exchange, and the consumer is single-threaded so no synchronization is needed. The most famous design is the Vyukov MPSC queue — one atomic exchange grabs the prev tail, then updates prev tail’s next pointer to the new node. No retry loops, almost zero contention.

The Vyukov MPSC core looks like this.

1
2
3
4
5
6
7
8
9
struct Node { std::atomic<Node*> next; T payload; };
std::atomic<Node*> tail;

void push(Node* node) {
    node->next.store(nullptr, std::memory_order_relaxed);
    Node* prev = tail.exchange(node, std::memory_order_acq_rel);
    prev->next.store(node, std::memory_order_release);
}
/* pop is single-consumer, almost no atomics needed */

This pattern underlies nearly every inter-thread communication in Unreal. That’s why ENQUEUE_RENDER_COMMAND, called hundreds of times per frame, has no lock contention.

FRenderCommandFence — when Game must wait for Render

Sometimes Game Thread needs to know “the commands sent to Render Thread really finished.” For example, to destroy a GPU resource safely, you must ensure the Render Thread no longer touches it.

1
2
3
FRenderCommandFence Fence;
Fence.BeginFence();      /* mark all render commands up to this point */
Fence.Wait();            /* block until marked commands all finish */

BeginFence enqueues a fence marker into the Render Thread’s queue. Wait makes Game Thread sleep on an FEvent until the fence is processed. This is essentially the only explicit synchronization point between Game ↔ Render.

FCriticalSection / FRWLock — explicit locks

Of course explicit locks are sometimes needed. Unreal provides:

  • FCriticalSection: abstraction over Windows CRITICAL_SECTION (pthread_mutex on others). Standard mutex.
  • FRWLock: Reader-Writer lock. On macOS mapped to pthread_rwlock instead of os_unfair_lock.
  • FScopeLock: RAII helper (equivalent to std::lock_guard).
  • TQueue<T, EQueueMode::Spsc>: lock-free single-producer single-consumer queue.
  • TQueue<T, EQueueMode::Mpsc>: lock-free multi-producer single-consumer.

Engine code itself prefers lock-free queues and TaskGraph dependencies; FCriticalSection appears occasionally in gameplay code (on top of the gameplay framework).

Unity vs Unreal

ItemUnityUnreal
Main threadone, all APIs go through itGame Thread, gameplay only
Render separationyes (Render Thread, fewer explicit APIs)strong (Render Thread + RHI Thread + RDG)
Job abstractionJob System + DOTSTaskGraph + Async Tasks
Compile-time race detectionNativeContainer + AtomicSafetyHandlenone (runtime asserts)
Lock-avoidance philosophydependency graph + main thread affinitynamed threads + lock-free queues
ECSDOTS / Entities (official)Mass Entity (5.x), various unofficial

Unity goes for making dependencies expressible safely and validating at schedule time (in Editor/Development) (safety on by default). Unreal goes for binding data to threads and communicating through command queues (performance by convention). Neither locks much, but the reasons and verification timing differ.


Part 8: Game Engine Patterns That Avoid Locks

Since locks are barely used inside engines, patterns that bypass them have evolved. Patterns you can use directly in game code:

Double Buffering

The simplest and most common. Alternate between two buffers. But to use it safely you need a guarantee that read and write don’t overlap in time — usually a frame boundary or fence.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/* Assumption: GameTick and PhysicsTick are called sequentially at frame boundary.
   Within one frame, the worker starts the next write only after read finishes. */

struct PhysicsState {
    std::vector<Transform> transforms;
};

PhysicsState buffers[2];
std::atomic<int> readIdx{0};   /* main reads */

/* Frame N — Worker picks the next buffer to write (read-finished guarantee required) */
void PhysicsTick() {
    int w = 1 - readIdx.load(std::memory_order_acquire);
    UpdatePhysics(buffers[w]);
    readIdx.store(w, std::memory_order_release);  /* publish */
}

/* Frame N — Main reads the latest published buffer */
void GameTick() {
    int r = readIdx.load(std::memory_order_acquire);
    Render(buffers[r]);
}

No lock anywhere. A single atomic swaps the “currently readable buffer index.” Memory cost is 2× the buffer, but contention is zero. However, this code isn’t safe if writer and reader run independent asynchronous loops. Right after publish, while reader still reads that buffer, if worker tries to write a different buffer in the next tick, the index could collide. Game engines typically design sync at frame boundaries so read and write phases are time-separated — only with that assumption is double buffer safe. Otherwise you need triple buffer or explicit sequence handoff.

Where this pattern appears in game engines:

  • Physics → Render: swap buffer at frame boundary, render thread draws N while physics prepares N+1
  • AI tick: compute next-frame behavior, swap at frame boundary
  • Network input: collect packets during a frame, swap at frame boundary

Triple Buffering

When writer and reader run independent loops and read/write can overlap in time, double buffer isn’t enough. Three buffers handle it — “currently drawing,” “just made,” “next to write.” Used in OS graphics stack’s swap chain, V-sync queue, game engine ring buffers, etc. With three buffers, the writer always finds a slot the reader isn’t using.

Lock-free SPSC Ring Buffer

A single-producer single-consumer queue is implementable without locks using just two atomics.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
template<typename T, size_t N>
struct SpscRing {
    T buf[N];
    alignas(64) std::atomic<size_t> head{0};  /* producer writes */
    alignas(64) std::atomic<size_t> tail{0};  /* consumer writes */

    bool push(const T& v) {
        size_t h = head.load(std::memory_order_relaxed);
        size_t next = (h + 1) % N;
        if (next == tail.load(std::memory_order_acquire))
            return false;  /* full */
        buf[h] = v;
        head.store(next, std::memory_order_release);
        return true;
    }

    bool pop(T& out) {
        size_t t = tail.load(std::memory_order_relaxed);
        if (t == head.load(std::memory_order_acquire))
            return false;  /* empty */
        out = buf[t];
        tail.store((t + 1) % N, std::memory_order_release);
        return true;
    }
};

alignas(64) keeps head and tail on different cache lines — preventing false sharing. Even on different cores, producer and consumer have no contention.

This structure underlies game ↔ render command queues, audio sample buffers, log queues, and almost any 1:1 communication.

Per-thread accumulation

When multiple threads must increment a counter, false sharing is a common trap. The fix is each thread holds its own slot, summed at the end.

1
2
3
4
5
6
7
8
9
struct alignas(64) Slot { int64_t v = 0; };
Slot per_thread[NUM_THREADS];

/* each thread */
per_thread[tid].v++;        /* plain store, not atomic */

/* summation (single thread) */
int64_t total = 0;
for (auto& s : per_thread) total += s.v;

Thanks to alignas(64), each slot owns its own cache line. A million increments are 50–100× faster than a false-sharing lock-free counter.

Frame-locked sync — explicit sync points

Instead of taking locks constantly per job, sync only at frame boundaries. Within a frame, one thread touches its own data, and main integrates all thread results at frame end.

Naughty Dog’s fiber system (see Part 9) is the extreme of this idea — split a frame into thousands of fibers, but advance to the next frame only at a sync point where all fibers complete.

Summary: the lock-avoiding mindset

Common traits of patterns used instead of locks inside engines:

  1. Bind data to a thread — when one thread owns its data, no lock needed
  2. Explicit communication channels — lock-free queues for inter-thread data transfer
  3. Time separation — double/triple buffer to separate read and write in time
  4. Spatial separation — per-thread slots to avoid false sharing
  5. Sync rarely — concentrate cost at natural sync points like frame boundaries

Locks themselves aren’t bad — lock contention and cache bouncing are expensive, and the patterns above naturally avoid both.


Part 9: The Cost of Locks — Measuring

Measure, don’t guess. Tools for confirming whether a lock is really expensive and where the cost comes from.

Linux — perf and perf c2c

1
2
3
4
5
6
7
8
9
# syscall (futex_wait/wake) frequency
$ perf stat -e syscalls:sys_enter_futex ./game

# cache miss, hitm (modified line hit on another core)
$ perf stat -e mem_load_uops_l3_hit_retired.xsnp_hitm ./game

# false sharing detection (Cache-to-Cache analysis)
$ perf c2c record ./game
$ perf c2c report

perf c2c is the standard tool for false sharing diagnosis. Frequent HITM (Hit Modified) events on the same cache line make that line a prime suspect.

Windows — Concurrency Visualizer, ETW

Visual Studio’s Concurrency Visualizer shows per-thread CPU usage, lock contention blocks, and I/O waits. WPA’s “Wait Analysis” page gives the same info in more detail.

1
2
3
4
# Track lock contention
wpr -start LockHeldTimes -filemode
# (run game)
wpr -stop trace.etl

macOS — Instruments System Trace

The “Thread State” track in the System Trace template visualizes thread blocking. “Pthread mutex contention” markers show separately, telling you which mutex is contended.

1
2
# Or measure on the fly via dtrace
$ sudo dtrace -n 'pid$target:libsystem_pthread:_pthread_mutex_lock:entry {@[ustack()]=count();}' -p <pid>

Cross-platform — Tracy Profiler

Tracy provides macros that directly track mutex usage.

1
2
3
4
5
6
7
8
9
10
#include "Tracy.hpp"
#include "TracyLock.hpp"

TracyLockable(std::mutex, m);

void DoWork() {
    std::lock_guard<LockableBase(std::mutex)> lk(m);
    ZoneScoped;
    /* ... */
}

Lock/unlock times and contention durations of mutexes wrapped with LockableBase are visualized on Tracy’s timeline. You can immediately see which mutex is hot.

Built-in engine profilers

  • Unity Profiler: Job System tab shows worker thread utilization and dependency wait time
  • Unreal Insights: visualizes TaskGraph, fence wait time, ENQUEUE_RENDER_COMMAND call frequency
  • PIX (Xbox/PC): shows D3D12 fence wait and RHI thread blocking

One-line diagnosis

“My game is slow — is the lock the cause?” The one-line answer is “if spinning or waiting shows up in the thread state visualization, that’s the cost.” Without even looking at code, looking only at the thread state graph tells you immediately whether a lock is really the problem.


Conclusion

Pulling together everything from this part:

The nature of race condition:

  • counter++ decomposes into load/modify/store, breaking atomicity
  • data race (undefined behavior in C++/Rust) and race condition (execution-order dependent) are different concepts
  • Locks guarantee atomicity, visibility, and ordering together

The lock family:

  • Mutex, Spinlock, Semaphore, RWLock, CondVar, Monitor, Barrier, Latch
  • Just different policies over the same atomic foundation

Lock construction:

  • Peterson works in theory, real CPUs need memory barriers
  • Hardware atomics: x86 LOCK CMPXCHG, ARM LDXR/STXR
  • Spin vs Sleep compares critical section length against context-switch cost

OS primitives:

  • Linux futex (2002), Windows SRWLock (Vista, 2007), macOS os_unfair_lock (2016)
  • All share the same idea: fast path in user-space atomic, slow path only enters kernel
  • macOS os_unfair_lock encodes owner thread ID for QoS donation

Hardware mechanism:

  • CPU cache: L1 (4 cycle) ~ L3 (50 cycle) ~ DRAM (300 cycle)
  • MESI: a cache line can be in Modified state on only one core
  • Atomicity comes from cache coherence serializing RFO
  • Cache line bouncing: intra-socket 30–50ns, cross-socket 150–300ns
  • False sharing: different variables on the same line cause unintended contention

Unity synchronization:

  • Main thread model: all Unity APIs main only
  • Job System: JobHandle dependency graph, batch 64, work-stealing
  • NativeContainer + AtomicSafetyHandle: schedule-time runtime race check (Editor/Development only)
  • Burst: ordinary reads/writes are plain load/store; only Interlocked.* calls emit hardware atomics
  • DOTS: automatically analyzes Component read/write, scheduler adds dependency edges at schedule time

Unreal synchronization:

  • Game / Render / RHI / Audio Thread separation
  • TaskGraph + Named Thread + dependency
  • ENQUEUE_RENDER_COMMAND works as a lock-free MPSC queue (conceptually)
  • FRenderCommandFence is the only explicit Game ↔ Render sync point

Lock-avoiding patterns:

  • Double/Triple buffer, lock-free SPSC ring, per-thread slot
  • Frame-locked sync concentrates cost at one point

In the next part, Part 11 — Deadlock and Starvation, we cover cases where locks work fine but the program still hangs — circular waits between two locks, priority inversion, livelock. After that, Part 12 covers memory model and atomic ordering, and Part 13 goes deeper into lock-free data structures and Unity Job System internals.

Stage 2’s original question can now be answered.

Why does a program with two threads writing the same variable die only sometimes?

  • “Write” decomposes into 3 steps (load/modify/store), so another thread can interleave anywhere in the middle
  • “Interleaving” is decided by the scheduler at arbitrary moments, so “sometimes”
  • “To prevent it,” we need the lock abstraction built on cache coherence and atomic instructions

References

Textbooks

  • Herlihy, M., Shavit, N. — The Art of Multiprocessor Programming, 2nd ed., Morgan Kaufmann, 2020 — Ch.2~7 (Mutex algorithms, lock-free, hardware foundations) — the canonical text on multithread synchronization
  • Silberschatz, A., Galvin, P. B., Gagne, G. — Operating System Concepts, 10th ed., Wiley, 2018 — Ch.6 (Synchronization Tools), Ch.7 (Synchronization Examples)
  • Tanenbaum, A. S., Bos, H. — Modern Operating Systems, 4th ed., Pearson, 2014 — Ch.2.3 (Interprocess Communication)
  • Russinovich, M., Solomon, D., Ionescu, A. — Windows Internals, 7th ed., Microsoft Press, 2017 — Ch.8 (System Mechanisms, SRWLock / pushlock internals)
  • Singh, A. — Mac OS X Internals: A Systems Approach, Addison-Wesley, 2006 — Ch.10 (Mach IPC, locks)
  • Drepper, U. — What Every Programmer Should Know About Memory, Red Hat, 2007 — the definitive primer on cache coherence — people.freebsd.org/~lstewart/articles/cpumemory.pdf
  • Gregory, J. — Game Engine Architecture, 3rd ed., CRC Press, 2018 — Ch.8.6~8.7 (Multithreading, Job systems)
  • McKenney, P. E. — Is Parallel Programming Hard, And, If So, What Can You Do About It?, 2024 ed. — kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html — free book by RCU’s author

Papers

  • Peterson, G. L. — “Myths About the Mutual Exclusion Problem”, Information Processing Letters, 1981 — Peterson’s original paper
  • Dijkstra, E. W. — “Cooperating Sequential Processes”, Programming Languages, 1968 — introduction of the semaphore
  • Lamport, L. — “A New Solution of Dijkstra’s Concurrent Programming Problem”, CACM, 1974 — bakery algorithm
  • Lamport, L. — “How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs”, IEEE TC, 1979 — definition of sequential consistency
  • Franke, H., Russell, R., Kirkwood, M. — “Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux”, OLS 2002 — futex introduction — kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf
  • Drepper, U. — “Futexes Are Tricky”, Red Hat, 2011 — pitfalls of futex implementation — akkadia.org/drepper/futex.pdf
  • Sweeney, T. et al. — “Concurrent Programming in Unreal Engine” (GDC, EpicGames Dev) — TaskGraph design
  • Boehm, H.-J. — “Threads Cannot Be Implemented as a Library”, PLDI 2005 — motivation for C++ memory model
  • Adve, S. V., Gharachorloo, K. — “Shared Memory Consistency Models: A Tutorial”, IEEE Computer, 1996 — memory model comparison

Official documentation

Unity official

Unreal official

Game development / GDC

  • Gyrling, C. — Parallelizing the Naughty Dog Engine Using Fibers, GDC 2015 — fiber-based sync — gdcvault.com/play/1022186
  • Schreiber, B. — Multithreading the Entire Destiny Engine, GDC 2015 — Bungie’s lock-free design
  • Boulton, M. — Threading the Frostbite Engine, GDC 2009 — DICE’s Job system
  • Reinders, J., Roberts, B. — Multithreading for Visual Effects, A K Peters, 2014 — VFX engine lock-free patterns
  • Vyukov, D. — Lock-Free / 1024cores — Vyukov MPSC, scalability resources — 1024cores.net

Blogs / articles

  • Preshing, J. — Preshing on Programming — atomic, memory ordering series — preshing.com
  • Howells, D. et al. — Linux Kernel Memory Barriers (memory-barriers.txt) — official kernel memory model guide
  • Chen, R. — The Old New Thing — Windows critical section/SRWLock retrospectives
  • Giesen, F. — Reading List on Multithreadingfgiesen.wordpress.com
  • Oakley, H. — The Eclectic Light Company — macOS os_unfair_lock, QoS observations
  • Bonzini, P. — “QEMU and lock-free RCU” — practical RCU application

Tools

  • Linux: perf c2c, perf lock, bpftrace, lockstat
  • Windows: Concurrency Visualizer (VS), WPA Wait Analysis, PIX (for games)
  • macOS: Instruments System Trace, dtrace, sample
  • Cross-platform: Tracy Profiler (LockableBase), Intel VTune, AMD μProf
  • ThreadSanitizer (TSan): GCC/Clang’s static & dynamic data race detector
Reading Guide

This is a long article. You don't have to read it in one sitting.

  • First time through — Part 1 (race condition) → Part 2 (lock family) → Part 8 (patterns that avoid locks) gives you the whole picture
  • For Stage 2's central answer — "why only sometimes" — go all the way to Part 5 (hardware and MESI)
  • If you want the engine internals — Part 6 (Unity) and Part 7 (Unreal) are the main course
  • OS implementation when you need it — Part 3 (how locks are built), Part 4 (futex/SRWLock/os_unfair_lock) work well as references

The two deepest sections — MESI state transitions and Burst's compile pipeline — are collapsed. Expand them only when you need them.


This post is licensed under CC BY 4.0 by the author.