CS Roadmap Part 10 — Synchronization Primitives: How Does a Mutex Let Only One Thread In?
- CS Roadmap (0) — Why CS Knowledge Matters More Than Ever in the AI Era
- CS Roadmap (1) — Arrays and Linked Lists: Reading the Terrain of Memory
- CS Roadmap (2) — Stack, Queue, Deque: Powerful Abstractions Born from Restriction
- CS Roadmap (3) — Hash Tables: Conditions and Limits of O(1) Lookup
- CS Roadmap (4) — Trees: Order, Balance, and Guaranteeing O(log n)
- CS Roadmap (5) — Graphs: The Network of Relationships, the Science of Paths
- CS Roadmap (6) — Memory Management: Stack & Heap, GC, and the Things That Eat Your Frames
- CS Roadmap (Bonus) — Heaps and Priority Queues: The Economics of Partial Order
- CS Roadmap Part 7 — OS Architecture: The Forking Paths of Unix, NT, and XNU
- CS Roadmap Part 8 — Processes and Threads: How the OS Abstracts Execution Units
- CS Roadmap Part 9 — Scheduling: Whom Does the OS Give the CPU To?
- CS Roadmap Part 10 — Synchronization Primitives: How Does a Mutex Let Only One Thread In?
- The reason two threads touching the same variable die only sometimes is that read-modify-write is not atomic. A lock is the abstraction "only one thread in the critical section at a time," and at the bottom of that abstraction there is always a hardware atomic instruction (x86 LOCK CMPXCHG, ARM LDXR/STXR)
- Mutex/Semaphore/RWLock/Spinlock/CondVar are just different policies on top of the same atomic. Linux puts the fast path in user space via futex and only enters the kernel on the slow path; Windows SRWLock and macOS os_unfair_lock follow the same idea
- The real cost of a lock is not the instruction itself but cache line bouncing. Under MESI, when one core grabs the lock, the corresponding line in other cores is Invalidated, and cross-socket the cost is 10× higher. False sharing is the trap that causes this unintentionally
- Unity's Job System builds a JobHandle dependency DAG that prevents read/write conflicts, and AtomicSafetyHandle catches races at Schedule() time (Editor/Development builds). Burst leaves normal NativeArray reads/writes as plain load/store and only emits hardware atomics for explicit Interlocked.* calls. DOTS analyzes ComponentSystem read/write to parallelize without locks
- Unreal separates Game/Render/RHI/Audio Threads and queues commands via TaskGraph + FRenderCommandFence. ENQUEUE_RENDER_COMMAND has no visible lock — it enqueues commands that the Render Thread executes in FIFO order
Introduction: What “Dies Only Sometimes” Really Means
When Stage 2 began, we asked one question.
Why does a program with two threads writing to the same variable die only sometimes?
After Part 7 — OS Architecture, Part 8 — Process and Thread, and Part 9 — Scheduling, we’ve reached half the answer. The OS swaps threads in invisible places, and those swap moments are unpredictable — that justifies the word “sometimes.”
This part covers the other half. A single line of code that reads, modifies, and writes is in fact not atomic, and we’ll see how the OS and CPU cooperate to manufacture the abstraction “only one at a time.”
What we’ll cover:
- The nature of a race condition: why
counter++loses data - The lock family: Mutex / Semaphore / RWLock / Spinlock / CondVar / Monitor / Barrier
- How locks are built: Peterson → Test-and-Set → CAS → hardware atomic
- OS-specific primitives: Linux futex, Windows SRWLock, macOS os_unfair_lock
- Hardware mechanism: CPU cache hierarchy, MESI, how atomic grabs cache line ownership, false sharing, memory barrier
- Unity synchronization (deep dive): Main Thread model, Job System, NativeContainer, AtomicSafetyHandle, Burst, DOTS — how data flows between cores
- Unreal synchronization: Game/Render/RHI Thread separation, TaskGraph, ENQUEUE_RENDER_COMMAND internals
- Game engine patterns: lockless ring buffer, double buffer to avoid locks, frame-locked sync
It sounds long, but compressed to one sentence: a lock is an abstraction, that abstraction sits on atomic instructions, and atomic instructions sit on cache line ownership. We’ll work our way down from the top.
Part 1: The Nature of a Race Condition
A one-line mystery
Look at this code.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
static int counter = 0;
void Worker() {
for (int i = 0; i < 1'000'000; ++i) {
counter++;
}
}
int main() {
std::thread t1(Worker);
std::thread t2(Worker);
t1.join(); t2.join();
std::cout << counter << "\n"; /* expected: 2,000,000 */
}
Two threads each added 1 a million times, so the result should be two million. But run it and you get a different number every time, almost always less than two million. On a desktop you’ll commonly see somewhere between 1,200,000 and 1,800,000.
The reason values vanish is that counter++ is one line but three steps in machine code. The figure below traces, in time order, what happens when two threads touch the same counter at the same time. The moment one increment is lost becomes obvious.
eax = 41
eax = 42
eax = 41
counter = 42
eax = 42
counter = 42
Cause: when B read 41 at t₁, A's result (42) wasn't in memory yet — read-modify-write is not atomic
race condition vs data race — not the same thing
Hold on, let’s settle this. Are race condition and data race the same thing?
They’re often used interchangeably but they differ academically.
- Data race: two threads access the same memory without synchronization, and at least one is a write. This is explicitly defined in multiple language memory models. In C++/Rust it is undefined behavior, while Java/Go define limited behavior within their memory models — not UB, but the result of a program that is not “correctly synchronized” can violate intuition.
- Race condition: a broader concept where the result depends on thread execution order. Even if two threads communicate safely through atomic variables, if business-logic outcomes depend on who arrives first, that’s a race condition.
So every data race causes a race condition, but not every race condition is a data race. Locks remove data races; race conditions are a design problem at a higher layer.
Atomicity, visibility, ordering — three guarantees
Synchronization primitives promise three things.
- Atomicity: an operation happens as a single indivisible unit, with no observable intermediate state
- Visibility: a value written by one thread is guaranteed to become visible to others
- Ordering: memory operations are seen by other threads in the order the program intended
counter++ broke because of atomicity. But visibility and ordering are separate problems too — CPUs reorder instructions, and caches don’t synchronize instantly. All three are tackled in depth in Part 12 (memory model and atomic operations), but they sit in the background throughout this part.
Hold on, let’s settle this. Can a plain read or plain write also be non-atomic?
Yes. On x86/ARM, naturally aligned 4-byte/8-byte reads and writes are generally atomic (separate from C++’s
std::atomicguarantees). But misaligned access, 16-byte SIMD, or 64-bit values on a 32-bit CPU can split a single store into two and let another core observe a half-written value. That’s why in C++ we usestd::atomic<T>instead of a plain variable — to tell the compiler “this really must be atomic.”
What a lock promises
The simplest way to keep counter++ from breaking:
1
2
3
4
5
6
std::mutex m;
/* ... */
{
std::lock_guard<std::mutex> lk(m);
counter++;
}
While lock_guard holds m, no other thread can enter on the same m. The result is always two million.
Two questions follow naturally.
- How does the OS manufacture the “one at a time” promise? — Part 3
- What does that promise cost? — Part 5
First let’s sort out the kinds of locks and how they differ.
Part 2: The Lock Family
Comparison at a glance
| Name | Essence | Who releases? | Wait style | Typical use |
|---|---|---|---|---|
| Mutex | 1-slot lock | locking thread | sleep | protecting a critical section |
| Recursive Mutex | re-entrant Mutex | locking thread | sleep | nested calls from same thread |
| Spinlock | 1-slot lock | locking thread | busy-wait | very short critical sections, kernel |
| Semaphore | N-slot counter | any thread | sleep | resource pools, producer-consumer |
| RWLock | many readers / one writer | locking thread | sleep | read-heavy data |
| Condition Variable | wait + signal | waking thread | sleep | condition-based synchronization |
| Monitor | Mutex + CondVar bundle | (per object) | sleep | Java synchronized |
| Barrier | wait for N threads to arrive | once all arrive | sleep | parallel-stage sync |
| Latch | one-shot countdown | when count hits 0 | sleep | init-complete signal |
Quick notes on each.
Mutex (Mutual Exclusion)
The most basic. Holds state 0 or 1, and only the thread that locked it can unlock it. If someone else has it, the incoming thread joins the wait queue and sleeps.
1
2
3
4
std::mutex m;
m.lock();
// critical section
m.unlock();
In C++ you almost always use a RAII wrapper like std::lock_guard or std::unique_lock. They guarantee unlock even on exception.
Hold on, let’s settle this. reentrant, recursive, thread-safe — three terms that get confused.
- Thread-safe: a function/object behaves correctly when called concurrently from multiple threads. A property observed from outside.
- Reentrant: a thread can re-enter a function it’s already executing (via interrupt or signal) safely. Conditions include no global variable use, no static buffers, etc.
- Recursive (lock): the same thread can acquire a lock it already holds. The counter goes up, and it must be released the same number of times.
Recursive mutex is convenient but usually a sign of poor lock design. If a deep function doesn’t know it already holds the lock and tries to acquire it again, that means the interface design left lock points unspecified.
Spinlock
Same state as Mutex but the failing thread does not sleep — it keeps retrying.
1
2
3
while (lock.test_and_set(std::memory_order_acquire)) {
/* busy-wait */
}
When is Spinlock better than Mutex? When the critical section is very short and context-switch cost outweighs spin cost. A classic case: kernel interrupt handlers can’t sleep at all, so only spinlock will do.
Conversely, if the critical section is long or you have more threads than cores, spinlock is a disaster. You’d burn CPU cycles on time that should be slept away.
Hold on, let’s settle this. What’s the difference between a spinlock and plain busy-wait?
A spinlock is a lock built on atomic primitives, while a plain busy-wait is just polling a regular variable. Polling a plain variable lets the compiler hoist the loop (
while(flag);becomes an infinite loop) or never see a change made by another core. A spinlock prevents both via atomic + memory ordering.
Semaphore
The oldest synchronization primitive, proposed by Dijkstra in 1965. A non-negative integer counter with two operations.
- P (wait, acquire): sleep if counter is 0, otherwise decrement by 1
- V (signal, release): increment by 1, wake one waiter if any
A Mutex is really “a binary semaphore initialized to 1.” But the semantics differ — a Mutex must be released by the locking thread, while a Semaphore can be released by anyone. That makes Semaphore suitable for resource pools (free slots in a connection pool) or expressing “remaining capacity / items” in producer-consumer queues.
Reader-Writer Lock
Many readers, one writer. Throughput improves when reads vastly outnumber writes.
1
2
3
4
5
6
7
8
9
std::shared_mutex m;
{
std::shared_lock<std::shared_mutex> r(m); /* multiple readers OK */
// read
}
{
std::unique_lock<std::shared_mutex> w(m); /* exclusive writer */
// write
}
Two pitfalls. First, the RWLock itself costs more than a plain Mutex. For very short critical sections, plain Mutex can be faster. Second, writer starvation — if readers keep arriving, a writer can wait forever. Most implementations provide a writer-preferring mode.
Condition Variable
Expresses “wait until a condition is satisfied.” Always used paired with a Mutex.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
std::mutex m;
std::condition_variable cv;
bool ready = false;
/* Waiter */
{
std::unique_lock<std::mutex> lk(m);
cv.wait(lk, []{ return ready; });
// woke up, ready == true
}
/* Notifier */
{
std::lock_guard<std::mutex> lk(m);
ready = true;
}
cv.notify_one();
The key is that cv.wait atomically does “release lock, sleep, on wake reacquire lock.” Without that, a notify arriving just before wait causes the lost wakeup problem.
Also, you pass the wait predicate as a lambda because of spurious wakeup. The standard permits CV to wake without the condition being satisfied, so you must re-check after waking.
Monitor
An abstraction that bundles Mutex + Condition Variable per object. Java’s synchronized keyword and wait/notify attached to every object are exactly this pattern. C#’s lock block is the same.
1
2
3
4
5
6
lock (gameState) {
while (!gameState.Ready)
Monitor.Wait(gameState);
/* work */
Monitor.PulseAll(gameState);
}
The monitor is embedded in the gameState object header so you don’t have to declare a separate mutex. Convenient, but the lock is tightly bound to one object, making lock granularity hard to adjust.
Barrier / Latch
Barrier: waits until N threads have all arrived at one point. Frame-locked parallel processing in games — “N physics update jobs must all finish before moving on” — is the canonical example. Reusable (cyclic barrier).
Latch: everyone waits until a counter reaches 0. Single-use. Useful for signaling once to all starting threads that initialization is done.
Summary: which lock when
The matrix below places locks on two axes — critical section length (horizontal) and read/write ratio (vertical) — and marks the suitable primitive. At a glance you can see how tools with the same atomic foundation but different policies find their place.
That covers the interface layer that users see. Now we go one layer down and see how the OS and compiler manufacture these promises.
Part 3: How Locks Are Built
Attempt 1 — Can software alone do it?
Start with the naive attempt. Can we build a lock for two threads using just one variable?
1
2
3
4
5
6
int locked = 0;
void lock() {
while (locked) ; /* spin */
locked = 1; /* got it! */
}
Obviously broken. Two threads that pass while(locked) simultaneously execute locked = 1 at the same time and both enter the critical section. There’s an empty gap between read and write where a race happens.
Peterson’s algorithm (1981)
You can build a lock in software alone. For two threads, the most elegant solution is Gary Peterson’s algorithm from 1981.
flag[2] = {0, 0} turn = 0 flag[i] = "I want to enter", turn = "next to yield"flag[0] = 1declare intent to enterturn = 1yield to Bwhile (flag[1] && turn == 1) ;wait if B also wants in and it's B's turnflag[0] = 0unlockflag[1] = 1declare intent to enterturn = 0yield to Awhile (flag[0] && turn == 0) ;spinning ...turn can hold only a single value (the last write wins). If that value is 0, A has yielded so B passes; if 1, vice versa — either way, exactly one thread escapes the while at step 3.The principle is combining “I yield” intent with “whose turn is it” agreement. Even when both threads enter at once, turn can hold only a single value, so only one passes.
Does this actually work? In theory yes, on real CPUs no.
Two reasons.
- CPU instruction reordering: the order in which
flag[self] = 1; turn = other;reaches memory can differ from code order. From another core, you might seeturnchange first whileflag[self]still looks like 0. - Store buffer: each core has a write buffer, so values you wrote aren’t immediately visible to other cores.
Solving both requires a memory barrier, which is essentially a hardware instruction. So software alone won’t do.
Attempt 2 — Hardware helps
What we fundamentally need is a single instruction that does “read-compare-write” atomically. CPU vendors provide special instructions for this. The three most common families, summarized in one figure:
old = *X
*X = true
XCHG · LOCK BTSARM
LDXR / STXR paircur = *X
cur == exp ?
*X = desired
LOCK CMPXCHGARM
LDXR / STXR · CAS (v8.1+)old = *X
*X = old + δ
LOCK XADDARM
LDADD (v8.1+)CAS(lock, 0, 1)enter critical section
retry
memory barrier included
store(lock, 0)CAS is the most general and powerful — it’s the basic unit of lock-free data structures (covered in Part 13). TAS suffices for simple mutual exclusion like a spinlock, and FAA is the core operation for counter increments and ticket locks.
Hold on, let’s settle this. Are “CAS” and “atomic” the same?
No. Atomic operation is the general concept of “any operation that happens as a single indivisible step.” CAS is one kind of atomic operation, with siblings TAS, FAA, LL/SC, and atomic load/store. C++’s
std::atomic<T>is an interface that abstracts these instructions, wherecompare_exchange_weakmaps to CAS andfetch_addto FAA.
CAS-based spinlock implementation
Building a spinlock with CAS:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
struct Spinlock {
std::atomic<bool> locked{false};
void acquire() {
bool expected = false;
while (!locked.compare_exchange_weak(
expected, true,
std::memory_order_acquire)) {
expected = false; /* gets clobbered on CAS failure */
while (locked.load(std::memory_order_relaxed))
__builtin_ia32_pause(); /* x86 PAUSE hint */
}
}
void release() {
locked.store(false, std::memory_order_release);
}
};
Two things worth noting.
- acquire / release memory order: ensures values written after acquiring the lock are visible to threads that later acquire the same lock. Details in Part 12.
PAUSEhint: on x86, inserting PAUSE in spin loops reduces power consumption and avoids memory-order violation penalties. ARM usesYIELDfor a similar purpose.
Spin or sleep — the decision rule
The same atomic underneath, just different policy, gives you spinlock or mutex. Which to use comes down to comparing critical section length against context-switch cost.
| Condition | Spin wins | Sleep wins |
|---|---|---|
| Critical section length | < 1μs | > 10μs |
| Threads vs cores | ≤ | > |
| Environment | interrupt handler, RT kernel | normal user-space code |
Modern mutex implementations are therefore adaptive. They spin briefly first, then fall back to sleep if they still can’t acquire. Linux glibc’s NPTL does this; Windows SRWLock does too.
Part 4: OS-specific Primitives
Linux futex — fast userspace mutex (2002)
Does pthread_mutex_lock need a syscall every time? If the lock is uncontended, there’s no reason to call into the kernel. That insight gave rise to futex.
Hubertus Franke, Rusty Russell, and Matthew Kirkwood introduced it into Linux in 2002. The core idea:
The pseudo-code commonly cited is a 3-state mutex implementation example (0 unlocked, 1 locked-no-waiters, 2 locked-with-waiters). But that state encoding is a user-space library’s policy choice, not part of futex itself — futex is the kernel-provided general wait/wake building block, and abstractions like mutex/semaphore/condvar are built on top with their own state encodings. The implementation above needs two bits because unlock must know whether waiters exist to decide whether to call the wake syscall.
A single uncontended futex operation costs roughly 10–20ns, less than one tenth of a syscall (~hundreds of ns). That’s why most user-space blocking lock implementations on Linux (pthread_mutex, sem_t, glibc’s std::mutex, etc.) are built on top of futex.
CAS(m, 0, 1)atomic_xchg(m, 2) · mark "waiters present"futex_wait(m, 2) syscall · enter kernel wait queue and sleepfetch_sub(m, 1) → check previous valuestore(m, 0) · unlockfutex_wake(m, 1) syscall · wake one waiterHold on, let’s settle this. How are “fast path” and “slow path” generally split?
In lock implementations, the fast path is the route that finishes as fast as possible when there’s no contention, and the slow path is the route that needs extra work when there is contention. Nearly every modern synchronization primitive — futex, SRWLock, os_unfair_lock, parking_lot — follows this pattern. Fast path: 1–2 CASes. Slow path: kernel entry or separate wait-queue management. The design relies on the statistical fact that locks are uncontended over 90% of the time.
Windows SRWLock and CRITICAL_SECTION
Windows has multiple generations of synchronization primitives, with different tradeoffs at each era.
Mutex (kernel object): the oldest. Handle-based, can be shared across processes. But every lock/unlock is a syscall, making it slow — ~hundreds of ns.
CRITICAL_SECTION: introduced in NT 4 (1996). Handles the fast path with user-space atomics and only sleeps via the WaitForSingleObject kernel event on contention. Same idea as futex, five years earlier. Drawbacks: a heavy struct (~40 bytes including internal counter and handle) and only works within a single process.
SRWLock (Slim Reader/Writer Lock): introduced in Windows Vista (2007). 8-byte pointer size, fast/slow path design almost identical to futex, with RWLock semantics. No init function needed (= SRWLOCK_INIT). For new code, SRWLock is almost always the answer.
1
2
3
4
5
6
7
8
9
SRWLOCK lock = SRWLOCK_INIT;
AcquireSRWLockExclusive(&lock); /* writer lock */
/* ... */
ReleaseSRWLockExclusive(&lock);
AcquireSRWLockShared(&lock); /* reader lock */
/* ... */
ReleaseSRWLockShared(&lock);
Internally, SRWLock packs the following into a single 8-byte atomic: locked bit, waiting bit, waking bit, reader count, and high bits of a wait-queue head pointer. Bit-packing is what makes a single CAS suffice for the fast path.
Condition Variable: CONDITION_VARIABLE, paired with SRWLock, was added in the same era.
Hold on, let’s settle this. Which to use:
CRITICAL_SECTIONorMutex?For new code within a single process, SRWLock is the answer. Lighter and supports RWLock.
CRITICAL_SECTIONis for compatibility with older code;Mutex(kernel object) is only when you need inter-process synchronization or a named mutex.
macOS os_unfair_lock and family
macOS is based on BSD pthread, but adds Mach port–based synchronization and Apple-specific locks.
pthread_mutex_t: the standard. Internally uses the Mach __ulock_wait / __ulock_wake syscalls (macOS’s futex equivalent). Same fast/slow path structure.
os_unfair_lock: introduced in macOS 10.12 / iOS 10 (2016). Created to replace OSSpinLock. OSSpinLock was vulnerable to priority inversion — if a low-priority thread held the lock and was demoted to E cores, a high-priority thread could spin forever.
1
2
3
4
5
os_unfair_lock lock = OS_UNFAIR_LOCK_INIT;
os_unfair_lock_lock(&lock);
/* ... */
os_unfair_lock_unlock(&lock);
The “unfair” in the name is intentional. It abandons FIFO fairness and instead records information about the holding thread in the lock. When priority inversion is detected, it temporarily boosts the holder’s priority (priority donation). This is enabled by encoding the owner’s thread ID in the 4-byte atomic.
Within os_unfair_lock’s 4 bytes is the owner thread’s mach thread port id. This lets the kernel know who holds the lock on the contended path and automatically perform QoS inheritance (boost). It connects directly to the QoS coverage in Part 9.
OSSpinLock: deprecated. Never use in new code.
NSLock / @synchronized: Objective-C object-level monitor. Every NSObject can potentially have a lock. Same idea as Java’s synchronized but costly.
Dispatch semaphore (dispatch_semaphore_t): GCD’s counting semaphore. P/V semantics.
The same idea across three OSes
| OS | uncontended (fast) | contended (slow) | recommended new-code lock |
|---|---|---|---|
| Linux | atomic CAS | futex(WAIT/WAKE) syscall | pthread_mutex, std::mutex |
| Windows | atomic CAS | NtWaitForAlertByThreadId (Win8+) | SRWLock |
| macOS | atomic CAS | __ulock_wait/wake syscall | os_unfair_lock |
Three OSes provide different names and APIs, but share these traits:
- Fast path: 1–2 user-space atomics
- Slow path: kernel wait queue only
- For RWLock, add reader count via bit packing
However, priority inheritance/donation handling differs by OS. macOS os_unfair_lock encodes the owner thread ID in 4 bytes so the kernel can immediately know the boost target. Linux records the owner only in the separate PI_futex mode (enabled via PTHREAD_PRIO_INHERIT) to support priority inheritance for RT tasks. Windows SRWLock and CRITICAL_SECTION don’t guarantee priority inheritance by default — we revisit this in Part 11 (deadlock and priority inversion).
The C++ standard library, Rust parking_lot, and Java j.u.c.locks are all built on top of these OS APIs.
Part 5: How Hardware Implements Locks
So far we’ve said “CAS happens atomically.” But how is that actually possible? A mechanism inside the CPU must make only one core pass through when multiple cores touch the same address at the same time. That mechanism is cache coherence, and atomic instructions ride on top of it.
CPU cache hierarchy — why multiple levels?
Modern CPUs don’t access memory directly. Multiple cache levels sit between cores and DRAM.
| Level | Size (per core / shared) | Access time | Owner |
|---|---|---|---|
| Register | ~32 | 0 cycle | core only |
| L1 D-cache | 32–48KB | 4–5 cycle | core only |
| L1 I-cache | 32–48KB | 4–5 cycle | core only |
| L2 | 256KB–1MB | 12–15 cycle | core only (usually) |
| L3 (LLC) | 4–64MB | 30–50 cycle | shared among same-socket cores |
| DRAM | GB | 100–300 cycle (200–400ns) | everyone |
| DRAM on other socket | GB | 200–600 cycle | NUMA |
The cache unit is the cache line, and on almost all modern x86/ARM systems it’s 64 bytes. When a CPU reads one byte from memory, it fetches all 64.
Hold on, let’s settle this. If each core has its own L1, can the same variable have different values across cores?
Solving exactly that problem is the cache coherence protocol. Multiple cores can hold copies of the same cache line, but the moment one core writes to that line, the others’ copies are invalidated or updated to prevent inconsistency. The most widely used protocols are MESI (Intel) and its MOESI variant (AMD).
MESI protocol
Each cache line is in one of four states.
| State | Meaning | Other cores |
|---|---|---|
| M (Modified) | only this core has it, differs from DRAM (dirty) | no copies |
| E (Exclusive) | only this core has it, matches DRAM (clean) | no copies |
| S (Shared) | multiple cores have the same value (clean) | same value present |
| I (Invalid) | this core’s copy is invalid | (it’s elsewhere) |
There’s just one core rule.
If a cache line is in M state, only that core has it.
When a write happens, that line in all other cores drops to I. To make this possible, cores exchange coherence messages.
| Message | Meaning |
|---|---|
| Read | “give me this line (for reading)” |
| Read-for-Ownership (RFO) | “give me this line (for writing)” — request to invalidate other cores’ copies |
| Invalidate | “discard your copy of this line” |
| Read-Response | “here’s that line” |
One-line summary — a 4-state machine (M/E/S/I) that tracks who owns a cache line and whose copy is still valid. When one core writes to a line, the others drop to I, and this is the mechanism atomic instructions use to manufacture “one at a time.”
The next visualization traces a single cache line moving between two cores across 7 steps. It’s a deep appendix, so it’s collapsed — expand when you need it.
▸ MESI state transitions in detail — every step of a cache line moving between cores
Read → DRAM → responseRead → Core 0 supplies dirty data, DRAM updatedRFO (Read-for-Ownership) → Core 0's copy invalidatedRFO → pulls dirty data from Core 1 and invalidates Core 1RFO → ping-pong continuesLOCK CMPXCHG instruction itself is ~10ns when the cache line is already in M state, but going through RFO makes it 30–100× more expensive. This is the real cost of lock contention.What an atomic instruction actually does
When LOCK CMPXCHG executes on x86:
- CPU fetches the target memory address’s cache line into Modified state (RFO message invalidates other cores’ copies)
- While the line is in M state — even if other cores send RFO simultaneously, the cache coherence mechanism serializes them — compare and swap proceeds
- The line stays in M or gets evicted at some later point
The point: atomic’s “atomicity” comes from the hardware fact that cache coherence serializes RFO requests. There isn’t a separate lock — the cache protocol’s invariant that a single line can only be in M state on one core at a time is the lock.
ARM is slightly different. ARM uses LL/SC (Load-Linked / Store-Conditional) pairs.
loop:
LDXR w0, [x1] ; load-exclusive, tracks address x1
CMP w0, w_expected
B.NE fail
STXR w2, w_desired, [x1] ; store-exclusive, fails if x1 was modified meanwhile
CBNZ w2, loop ; retry on failure
LL/SC’s advantage: immunity to part of the CAS ABA problem. Its drawback: failure is possible so a retry loop is needed.
cache line bouncing — the real cost of locks
What happens when two cores take turns grabbing the same lock?
Core A takes the lock → cache line in M state on A Core B tries CAS → cache line moves from A to B; A becomes I, B becomes M Core A takes the lock again → moves from B to A; B becomes I, A becomes M
This ping-pong is cache line bouncing. Cost per bounce:
| Scenario | Cost |
|---|---|
| Cores sharing same L3 (intra-socket) | ~30–50ns |
| Different sockets (NUMA, cross-socket) | ~150–300ns |
| Different NUMA nodes | ~hundreds to 1000ns |
The LOCK CMPXCHG instruction itself costs ~10ns when the line is already in M state. With bouncing, it becomes 30–100× more expensive. That’s the real cost of lock contention, and why the cache effect matters more than the lock itself.
Hold on, let’s settle this. If cache coherence guarantees consistency automatically, why do we still need locks?
They guarantee different things.
- Cache coherence promises “all copies of one cache line will eventually be consistent.” Per-memory-location.
- Lock promises “make a critical section spanning multiple memory locations behave as a single transaction.” Per-semantic-unit.
For example
account_a -= x; account_b += x;spans two cache lines. Coherence guarantees consistency of each line, but a lock guarantees that both changes appear together.
False sharing — unintentional cache line bouncing
Consider this struct.
1
2
3
4
struct Counters {
std::atomic<int> threadA_count; /* offset 0 */
std::atomic<int> threadB_count; /* offset 4 */
};
Thread A only touches its counter, thread B only touches its own. Logically there’s no shared data. But if the two atomics sit in the same 64-byte cache line, A’s write invalidates B’s line and vice versa. They ping-pong meaninglessly. This is false sharing.
The fix is cache line alignment.
1
2
3
4
struct Counters {
alignas(64) std::atomic<int> threadA_count; /* line 0 */
alignas(64) std::atomic<int> threadB_count; /* line 1 */
};
C++17 provides the std::hardware_destructive_interference_size constant to know the cache line size at compile time.
In game engine code, false sharing typically shows up in:
- per-thread stat arrays —
int hits[NUM_THREADS]with adjacent slots - producer/consumer ring buffer head/tail pointers
- arrays of small Job structures packed tightly
Measurement is hard. The CPU counter mem_load_uops_l3_hit_retired.xsnp_hitm (Intel) is one indicator that catches false sharing. Linux perf c2c automates this.
Memory barrier — the hardware side of ordering
Cache coherence makes values consistent, but ordering is a separate problem. CPUs reorder instructions, and store buffers briefly hold writes inside the issuing core.
1
2
3
4
5
6
7
/* Thread A */
data = 42;
ready = true; /* is the order of these two stores guaranteed to other cores? */
/* Thread B */
if (ready)
use(data); /* is data really 42? */
x86 has TSO (Total Store Order) so store-store reordering doesn’t occur, but ARM is weakly ordered and the above code can break. ARM needs a memory barrier (DMB ST) between the two stores.
One thing a lock promises us is this ordering. mutex.unlock() internally includes a release-semantics barrier, and mutex.lock() includes an acquire-semantics barrier, so values written inside the lock appear consistently outside.
| Barrier | x86 | ARM | Meaning |
|---|---|---|---|
| store-store | (automatic) | DMB ST | after prior stores complete, then later stores |
| load-load | (automatic) | DMB LD | after prior loads complete, then later loads |
| store-load | MFENCE | DMB SY | prior store becomes globally visible, then later loads |
| full | MFENCE | DMB SY | serialize all memory operations |
Part 12 handles this in depth. The point here is just that a lock is a bundle of atomic instructions + barriers.
Summary: what one lock causes
A mutex.lock() / mutex.unlock() pair traverses these steps, from user space to hardware cache:
send RFO message to other cores
bring that line into I state and fetch data (becomes M)
LOCK CMPXCHG (x86) · LDXR/STXR (ARM): try 0 → 1futex_wait / NtWaitForAlertByThreadId / __ulock_wait — register on wait queue and sleepprevent memory ops after the atomic from being reordered before it — ARM uses
DMB ISH or LDA*, x86 plain load sufficesprevent values written inside the lock from being reordered after the unlock store — ARM uses
DMB ISH or STL*, x86 plain store sufficesset lock variable to 0 (including waiters-bit check)
futex_wake / NtAlertThreadByThreadId / __ulock_wakeThat’s the bottom of locks. Now we see how game engines solve synchronization on top of all this.
Part 6: Unity Synchronization — Main Thread, Job System, DOTS
Game engines don’t use locks as is. One frame at 60fps is 16.67ms, and a single lock contention can cost 100–300ns to a few μs. A thousand of them eat 1ms before anything else. So engines adopt lock-avoiding structures. Two ideas are central:
- Thread affinity: certain data is touched by only one thread (then no lock is needed at all)
- Dependency-based parallelism: instead of locks, build a dependency graph that prevents read/write conflicts at scheduling time
Unity does exactly these two.
Main Thread model
All Unity MonoBehaviour callbacks — Update, LateUpdate, FixedUpdate, OnGUI, OnTriggerEnter, etc. — execute on a single thread, the Main Thread. And almost every Unity API (Transform.position, GameObject.Find, Component.GetComponent, etc.) can only be called from the main thread. Calling from another thread throws UnityException: ... can only be called from the main thread.
This is an enormous simplification. There isn’t a single lock in the entire scene graph — because all access is on one thread.
Hold on, let’s settle this. Is Unity’s main thread OS thread #1?
“#1” is meaningless from the OS perspective. The main thread is whichever thread is created first when the Unity process starts; from the OS’s view it’s a normal pthread/Win32 thread. The Unity runtime simply records this thread’s ID and uses it for
IsMainThread()checks. On macOS, the main thread is tied to NSRunLoop and runs alongside UI events.
The pitfall of the main thread model is one thing. Heavy work on the main thread is frame drops, period. That’s why Unity spins up worker threads separately and provides the abstraction called Job System on top.
Unity Job System
The Job System does two things at once.
- Parallel processing on native memory: doesn’t touch the managed heap, so GC-irrelevant
- Race detection at schedule time: dependency graph and NativeContainer safety check read/write conflicts. The attributes (
[ReadOnly]/[WriteOnly]/[NativeDisableContainerSafetyRestriction]) are read by the compiler, but the actual conflict check is a runtime check at the time ofSchedule()and is enabled only with theENABLE_UNITY_COLLECTIONS_CHECKSmacro (Editor/Development builds)
Basic usage:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
public struct ApplyVelocityJob : IJobParallelFor {
[ReadOnly] public NativeArray<float3> velocities;
[WriteOnly] public NativeArray<float3> positions;
public float deltaTime;
public void Execute(int index) {
positions[index] += velocities[index] * deltaTime;
}
}
void Update() {
var job = new ApplyVelocityJob {
velocities = velocityBuffer,
positions = positionBuffer,
deltaTime = Time.deltaTime,
};
JobHandle handle = job.Schedule(positionBuffer.Length, 64); /* batch=64 */
handle.Complete();
}
Schedule doesn’t execute the job immediately — it registers it in the dependency graph. JobHandle tracks the job’s completion.
Internal flow — where does one call go?
Tracing what Schedule() triggers, step by step:
- Compile time: IL2CPP/Burst reads the
[ReadOnly],[WriteOnly]attributes and marksvelocitiesas read-only,positionsas write-only - Schedule call (main thread):
- Creates a JobHandle and registers dependencies
- Copies the Job struct into unmanaged memory (so workers don’t touch the main heap)
- Inspects the NativeContainers’ AtomicSafetyHandle — if another in-flight job writes the same container, throws a runtime exception (not at compile time)
- Worker thread (
Unity Job Worker N):- Pulls a job from its own wait list (work-stealing deque)
- For
IJobParallelFor, slices the index range into batches (64 each) and distributes - Calls
Execute(index)for each batch
- Complete call (main thread):
handle.Complete()checks that all transitively dependent jobs are done in the dependency graph- If not done, the main thread also acts as a worker and steals jobs (work-stealing — main thread becomes a worker too)
How does data flow between cores?
Trace how positionBuffer is touched by a worker thread, in cache units.
- At Schedule time: the first 64 slots of
positionBuffer(16 bytes × 64 = 1024 bytes = 16 cache lines) are in the main thread CPU’s L1/L2 (touched in the previous frame). - Worker thread starts: worker N receives
Execute(0)~Execute(63). If the worker is on a different core, RFO messages are issued for the 64 slots’ cache lines. The lines move from the main thread’s L1 to the worker core’s L1 — at 30–50ns each intra-socket, 16 lines means ~600ns warmup for the first batch. - Batch execution: as the worker processes 64 sequentially, the cache lines stay in the worker core’s L1. With 16-byte slots, 4 per line, the line is touched 4 times in M state — cache hit.
- Next batch (64–127): another 16 lines move to the worker’s L1. The previous 16 lines get evicted or remain in worker’s L2/L3.
- At Complete time: when the main thread reads the results, all lines RFO back to the main CPU.
Two costs are visible:
- First batch warmup: cache lines moving from main → worker
- Result reclamation: lines moving back from worker → main
So if a Job’s input/output is small, the scheduling overhead itself can exceed the work cost. The batchCount parameter (64 above) of IJobParallelFor tunes this tradeoff. Too small means cache miss + dispatch overhead at every batch boundary; too large means load balancing breaks. Unity docs guide “16–64 is usually better than 1” for this reason.
The figure below shows the Job dependency DAG and worker mapping. The top is the dependency graph (logical order), the bottom is the actual worker thread mapping (time axis).
read input
read AABB
write Position
read Position
read Position
Schedule() time (Editor/Development builds only) catches container-permission conflicts and throws exceptions.NativeContainer and AtomicSafetyHandle
Unity catches races even on non-GC native memory. How?
NativeContainers like NativeArray<T>, NativeList<T>, NativeQueue<T> carry an internal AtomicSafetyHandle. It’s debug metadata enabled only in Editor and Development builds. Roughly:
1
2
3
4
struct AtomicSafetyHandle {
int version; /* has the container been disposed? */
AtomicSafetyNodePtr nodePtr; /* manages read/write reader lists */
}
The handle tracks:
- Jobs currently writing this container: 0 or 1 (if 1, no one else can even read)
- Jobs currently reading this container: N (writers blocked)
- Whether the container is alive (DisposeSentinel): throws immediately on access to a disposed container
At Schedule time, Unity compares the job’s [ReadOnly]/[WriteOnly] markings with the AtomicSafetyHandle state and throws an exception when it finds a possible conflict. It blocks the Schedule call itself.
1
2
3
4
5
6
var a = new NativeArray<int>(100, Allocator.TempJob);
var jobA = new WriteJob { data = a }.Schedule(); /* writes a */
var jobB = new ReadJob { data = a }.Schedule(); /* reads a — conflict! */
/* InvalidOperationException: The previously scheduled job WriteJob writes
to the NativeArray a. You must call JobHandle.Complete() on the job
before you can read from the NativeArray safely. */
The fix is to express the dependency:
1
2
3
var handleA = jobA.Schedule();
var handleB = jobB.Schedule(handleA); /* B after A */
handleB.Complete();
To bypass this, you add [NativeDisableContainerSafetyRestriction], but that’s a declaration of “I take responsibility for the race.” Use only when truly safe — e.g., two jobs with non-overlapping index ranges.
Hold on, let’s settle this. Does AtomicSafetyHandle work in production builds?
No. It’s enabled only in Editor and Development builds, gated by the
ENABLE_UNITY_COLLECTIONS_CHECKSmacro. Release builds strip all safety check code. This is not a static guarantee — a clean run in the Editor means for the code paths exercised at that time the safety system didn’t see a conflict, but it isn’t a proof that no race will occur under different inputs and different timing. So before shipping to production, you need tests that exercise as many code paths as possible, and[NativeDisableContainerSafetyRestriction]paths bypass the check entirely, so they require extra care.
Burst — how is atomic guaranteed?
In short — Burst compiles IL to native code via LLVM, but it does not turn ordinary array accesses into atomics. If you need atomic, you must call Interlocked.* explicitly, and only those get emitted as hardware atomic instructions. That’s the key point; the 4–5 steps below are compiler internals and are collapsed.
▸ Burst compile pipeline in detail — IL → native, atomic emit, NoAlias, SIMD mapping
A job tagged [BurstCompile] is compiled to native code instead of IL (LLVM backend). What Burst does when compiling NativeArray access:
- Ordinary
array[i]reads/writes are plain load/store — not atomic. Races are possible, and conflict prevention depends on schedule-time dependency + safety system - Only explicit atomic calls like
Interlocked.*are emitted directly as hardware atomic instructions like x86LOCK XADD/LOCK CMPXCHGor ARMLDADD/LDXR-STXR - Keeps bounds checks SIMD-friendly, or enables them only in the Editor
- Uses
[NoAlias]hints to assume ptr aliasing — enables more aggressive compiler optimization - Maps SIMD intrinsics (
Unity.Mathematics.float4) to SSE/AVX/NEON
1
2
3
4
5
6
7
8
9
10
11
12
[BurstCompile]
public struct CountJob : IJobParallelFor {
[NativeDisableContainerSafetyRestriction]
public NativeArray<int> counter; /* length 1, all jobs share the same slot */
public void Execute(int i) {
unsafe {
Interlocked.Increment(ref UnsafeUtility.As<int, int>(
ref counter.GetUnsafePtrReadOnly()[0]));
}
}
}
The moment Burst sees Interlocked.Increment, on x86 it emits directly as LOCK INC or LOCK XADD. That is, the hardware atomic instructions from Part 5 happen exactly as is. Cache line bouncing costs apply just the same.
Hold on, let’s settle this. What happens when jobs write to slots in the same cache line simultaneously?
That’s exactly false sharing. In the example above, with a length-1 counter, every job does atomic increment, so that 1 byte (actually 4 bytes, 1 line) ping-pongs between cores endlessly. A million calls vanish 50–100ms in cache bouncing alone. The fix is per-thread local counters on different cache lines, summed at the end —
[ThreadStatic]orNativeQueue<int>.Concurrent’s enqueue pattern.
DOTS / ECS — system-level dependencies
SystemBase or ISystem-based ECS takes the above mechanism one step further.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
public partial struct MoveSystem : ISystem {
public void OnUpdate(ref SystemState state) {
new MoveJob { dt = SystemAPI.Time.DeltaTime }
.ScheduleParallel();
}
}
[BurstCompile]
public partial struct MoveJob : IJobEntity {
public float dt;
void Execute(ref LocalTransform t, in Velocity v) {
t.Position += v.Value * dt;
}
}
IJobEntity automatically extracts read/write component access from the signature (ref = write, in = read). The source generator creates metadata for this info at compile time, and at runtime the World scheduler reads it and handles the following automatically:
- Two systems writing the same Component get automatic ordering
- Systems touching different Components automatically run in parallel
This is the most extreme example of resolving synchronization not with locks but with a dependency graph. The programmer never locks, but the scheduler checks read/write permissions at Schedule time and serializes conflicts (Editor/Development builds get additional safety system checks).
Internally, ECS maintains dependency info per Component with ReaderWriterLock semantics. Combined with Burst, it looks like the diagram below.
ref=write, in=read) and serializes only read-after-write conflicts. Result: work touching different Components runs in parallel freely; touching the same Component gets automatic ordering.The dependency where RenderBoundsJob waits for MoveJob to complete is automatically inferred by ECS without the developer specifying it. Both jobs touch LocalTransform.Position, MoveJob writes and RenderBoundsJob reads, so the ECS scheduler automatically adds the dependency edge.
One frame’s memory flow — summary
Synchronization-related events Unity triggers within a single frame (16.67ms), in time order:
| Time | Location | What happens |
|---|---|---|
| 0ms | Main | Input/MonoBehaviour Update — main thread only, no locks |
| 2ms | Main | Schedule jobs, build dependency graph |
| 2~10ms | Worker 1~N | Workers execute jobs in dependency order; NativeArray cache lines RFO-move to worker cores |
| 10ms | Main | JobHandle.Complete() or sync point — main also acts as a worker via work-stealing |
| 12ms | Render thread | submit command buffer to GPU (next section) |
| 16ms | GPU | present, next frame begins |
That covers Unity. Unreal has a similar philosophy but with stronger thread separation.
Part 7: Unreal Engine Synchronization
Unreal has a more explicit multi-thread model than Unity. The engine itself is split into multiple named threads, and inter-thread communication is built almost entirely on lock-free queues.
Four Named Threads
A typical Unreal game has these threads always running:
| Thread | Responsibility | OS thread |
|---|---|---|
| Game Thread | Tick, gameplay logic, Blueprint, AI | main thread (Unity’s main equivalent) |
| Render Thread | high-level render command generation (FRDG, RHI command authoring) | separate OS thread |
| RHI Thread | GPU API calls (D3D12/Vulkan/Metal/Mantle) | separate OS thread |
| Audio Thread | sound mixing, voice management | separate OS thread |
| Worker Threads | TaskGraph job execution | one per core |
The core idea is each thread holds data only it touches, and inter-thread communication goes through explicit command queues. Instead of locking, data ownership is bound firmly to a thread.
Game Thread → Render Thread — one frame’s flow
Unreal’s Render Thread usually runs one frame behind the Game Thread (Epic’s docs describe it as 0 or 1 frame behind — if Game finishes fast Render can catch up; if Render is heavy, Game blocks on sync next frame). This article describes the “one frame behind” case under normal load. When Game runs frame N’s logic, Render builds frame N-1’s render commands, and RHI submits frame N-2 to the GPU.
Tick, AI, Physics, Blueprint
RDG build, FMeshBatch generation
D3D12/Vulkan/Metal API
actual pixels displayed
Thanks to this pipelining, almost no lock is needed. Only one thread touches the same data at any moment.
TaskGraph — Unreal’s Job System
FTaskGraphInterface is the Unreal equivalent of Unity Job System.
1
2
3
4
5
6
7
8
9
10
11
12
FGraphEventRef MyTask = FFunctionGraphTask::CreateAndDispatchWhenReady(
[Data]() {
// executes on a worker thread
ProcessData(Data);
},
TStatId(),
nullptr, /* dependency prerequisite */
ENamedThreads::AnyThread /* which thread to run on */
);
/* wait for the result later */
FTaskGraphInterface::Get().WaitUntilTaskCompletes(MyTask);
Features:
- Named thread selection: dispatch to
GameThread,RenderThread,RHIThread, orAnyThread - Dependency edge expression: pass a prerequisite array to start after they complete
- Automatic work-stealing: AnyThread is picked up by the least busy worker
- Hierarchical tasks: spawn child tasks from within a task
ENQUEUE_RENDER_COMMAND — a command queue with no visible lock
The most common macro for passing data from Game Thread to Render Thread.
1
2
3
4
5
6
7
8
FVector NewPos = Actor->GetActorLocation();
FRHICommandListImmediate& RHICmdList = ...;
ENQUEUE_RENDER_COMMAND(UpdateActorPos)(
[NewPos](FRHICommandListImmediate& RHICmdList) {
/* this lambda executes on the Render Thread */
UpdateConstantBuffer(RHICmdList, NewPos);
});
What this macro means: wrap the lambda into a command object, enqueue it on the Render Thread’s command queue, and the Render Thread pulls in FIFO order to execute. Epic’s docs don’t explicitly guarantee the internal queue is lock-free MPSC (a version-dependent implementation detail), and the macro’s contract is just “ordered + execute on Render Thread”. Conceptually it’s a multi-producer (workers can enqueue too) single-consumer (Render Thread only pops) pattern, which is generally a natural fit for lock-free MPSC.
Hold on, let’s settle this. How does a lock-free MPSC queue work without locks?
The key is the asymmetry: producers append to the queue tail via atomic CAS or atomic exchange, and the consumer is single-threaded so no synchronization is needed. The most famous design is the Vyukov MPSC queue — one atomic exchange grabs the prev tail, then updates prev tail’s next pointer to the new node. No retry loops, almost zero contention.
The Vyukov MPSC core looks like this.
1
2
3
4
5
6
7
8
9
struct Node { std::atomic<Node*> next; T payload; };
std::atomic<Node*> tail;
void push(Node* node) {
node->next.store(nullptr, std::memory_order_relaxed);
Node* prev = tail.exchange(node, std::memory_order_acq_rel);
prev->next.store(node, std::memory_order_release);
}
/* pop is single-consumer, almost no atomics needed */
This pattern underlies nearly every inter-thread communication in Unreal. That’s why ENQUEUE_RENDER_COMMAND, called hundreds of times per frame, has no lock contention.
FRenderCommandFence — when Game must wait for Render
Sometimes Game Thread needs to know “the commands sent to Render Thread really finished.” For example, to destroy a GPU resource safely, you must ensure the Render Thread no longer touches it.
1
2
3
FRenderCommandFence Fence;
Fence.BeginFence(); /* mark all render commands up to this point */
Fence.Wait(); /* block until marked commands all finish */
BeginFence enqueues a fence marker into the Render Thread’s queue. Wait makes Game Thread sleep on an FEvent until the fence is processed. This is essentially the only explicit synchronization point between Game ↔ Render.
FCriticalSection / FRWLock — explicit locks
Of course explicit locks are sometimes needed. Unreal provides:
FCriticalSection: abstraction over WindowsCRITICAL_SECTION(pthread_mutex on others). Standard mutex.FRWLock: Reader-Writer lock. On macOS mapped to pthread_rwlock instead ofos_unfair_lock.FScopeLock: RAII helper (equivalent tostd::lock_guard).TQueue<T, EQueueMode::Spsc>: lock-free single-producer single-consumer queue.TQueue<T, EQueueMode::Mpsc>: lock-free multi-producer single-consumer.
Engine code itself prefers lock-free queues and TaskGraph dependencies; FCriticalSection appears occasionally in gameplay code (on top of the gameplay framework).
Unity vs Unreal
| Item | Unity | Unreal |
|---|---|---|
| Main thread | one, all APIs go through it | Game Thread, gameplay only |
| Render separation | yes (Render Thread, fewer explicit APIs) | strong (Render Thread + RHI Thread + RDG) |
| Job abstraction | Job System + DOTS | TaskGraph + Async Tasks |
| Compile-time race detection | NativeContainer + AtomicSafetyHandle | none (runtime asserts) |
| Lock-avoidance philosophy | dependency graph + main thread affinity | named threads + lock-free queues |
| ECS | DOTS / Entities (official) | Mass Entity (5.x), various unofficial |
Unity goes for making dependencies expressible safely and validating at schedule time (in Editor/Development) (safety on by default). Unreal goes for binding data to threads and communicating through command queues (performance by convention). Neither locks much, but the reasons and verification timing differ.
Part 8: Game Engine Patterns That Avoid Locks
Since locks are barely used inside engines, patterns that bypass them have evolved. Patterns you can use directly in game code:
Double Buffering
The simplest and most common. Alternate between two buffers. But to use it safely you need a guarantee that read and write don’t overlap in time — usually a frame boundary or fence.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/* Assumption: GameTick and PhysicsTick are called sequentially at frame boundary.
Within one frame, the worker starts the next write only after read finishes. */
struct PhysicsState {
std::vector<Transform> transforms;
};
PhysicsState buffers[2];
std::atomic<int> readIdx{0}; /* main reads */
/* Frame N — Worker picks the next buffer to write (read-finished guarantee required) */
void PhysicsTick() {
int w = 1 - readIdx.load(std::memory_order_acquire);
UpdatePhysics(buffers[w]);
readIdx.store(w, std::memory_order_release); /* publish */
}
/* Frame N — Main reads the latest published buffer */
void GameTick() {
int r = readIdx.load(std::memory_order_acquire);
Render(buffers[r]);
}
No lock anywhere. A single atomic swaps the “currently readable buffer index.” Memory cost is 2× the buffer, but contention is zero. However, this code isn’t safe if writer and reader run independent asynchronous loops. Right after publish, while reader still reads that buffer, if worker tries to write a different buffer in the next tick, the index could collide. Game engines typically design sync at frame boundaries so read and write phases are time-separated — only with that assumption is double buffer safe. Otherwise you need triple buffer or explicit sequence handoff.
Where this pattern appears in game engines:
- Physics → Render: swap buffer at frame boundary, render thread draws N while physics prepares N+1
- AI tick: compute next-frame behavior, swap at frame boundary
- Network input: collect packets during a frame, swap at frame boundary
Triple Buffering
When writer and reader run independent loops and read/write can overlap in time, double buffer isn’t enough. Three buffers handle it — “currently drawing,” “just made,” “next to write.” Used in OS graphics stack’s swap chain, V-sync queue, game engine ring buffers, etc. With three buffers, the writer always finds a slot the reader isn’t using.
Lock-free SPSC Ring Buffer
A single-producer single-consumer queue is implementable without locks using just two atomics.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
template<typename T, size_t N>
struct SpscRing {
T buf[N];
alignas(64) std::atomic<size_t> head{0}; /* producer writes */
alignas(64) std::atomic<size_t> tail{0}; /* consumer writes */
bool push(const T& v) {
size_t h = head.load(std::memory_order_relaxed);
size_t next = (h + 1) % N;
if (next == tail.load(std::memory_order_acquire))
return false; /* full */
buf[h] = v;
head.store(next, std::memory_order_release);
return true;
}
bool pop(T& out) {
size_t t = tail.load(std::memory_order_relaxed);
if (t == head.load(std::memory_order_acquire))
return false; /* empty */
out = buf[t];
tail.store((t + 1) % N, std::memory_order_release);
return true;
}
};
alignas(64) keeps head and tail on different cache lines — preventing false sharing. Even on different cores, producer and consumer have no contention.
This structure underlies game ↔ render command queues, audio sample buffers, log queues, and almost any 1:1 communication.
Per-thread accumulation
When multiple threads must increment a counter, false sharing is a common trap. The fix is each thread holds its own slot, summed at the end.
1
2
3
4
5
6
7
8
9
struct alignas(64) Slot { int64_t v = 0; };
Slot per_thread[NUM_THREADS];
/* each thread */
per_thread[tid].v++; /* plain store, not atomic */
/* summation (single thread) */
int64_t total = 0;
for (auto& s : per_thread) total += s.v;
Thanks to alignas(64), each slot owns its own cache line. A million increments are 50–100× faster than a false-sharing lock-free counter.
Frame-locked sync — explicit sync points
Instead of taking locks constantly per job, sync only at frame boundaries. Within a frame, one thread touches its own data, and main integrates all thread results at frame end.
Naughty Dog’s fiber system (see Part 9) is the extreme of this idea — split a frame into thousands of fibers, but advance to the next frame only at a sync point where all fibers complete.
Summary: the lock-avoiding mindset
Common traits of patterns used instead of locks inside engines:
- Bind data to a thread — when one thread owns its data, no lock needed
- Explicit communication channels — lock-free queues for inter-thread data transfer
- Time separation — double/triple buffer to separate read and write in time
- Spatial separation — per-thread slots to avoid false sharing
- Sync rarely — concentrate cost at natural sync points like frame boundaries
Locks themselves aren’t bad — lock contention and cache bouncing are expensive, and the patterns above naturally avoid both.
Part 9: The Cost of Locks — Measuring
Measure, don’t guess. Tools for confirming whether a lock is really expensive and where the cost comes from.
Linux — perf and perf c2c
1
2
3
4
5
6
7
8
9
# syscall (futex_wait/wake) frequency
$ perf stat -e syscalls:sys_enter_futex ./game
# cache miss, hitm (modified line hit on another core)
$ perf stat -e mem_load_uops_l3_hit_retired.xsnp_hitm ./game
# false sharing detection (Cache-to-Cache analysis)
$ perf c2c record ./game
$ perf c2c report
perf c2c is the standard tool for false sharing diagnosis. Frequent HITM (Hit Modified) events on the same cache line make that line a prime suspect.
Windows — Concurrency Visualizer, ETW
Visual Studio’s Concurrency Visualizer shows per-thread CPU usage, lock contention blocks, and I/O waits. WPA’s “Wait Analysis” page gives the same info in more detail.
1
2
3
4
# Track lock contention
wpr -start LockHeldTimes -filemode
# (run game)
wpr -stop trace.etl
macOS — Instruments System Trace
The “Thread State” track in the System Trace template visualizes thread blocking. “Pthread mutex contention” markers show separately, telling you which mutex is contended.
1
2
# Or measure on the fly via dtrace
$ sudo dtrace -n 'pid$target:libsystem_pthread:_pthread_mutex_lock:entry {@[ustack()]=count();}' -p <pid>
Cross-platform — Tracy Profiler
Tracy provides macros that directly track mutex usage.
1
2
3
4
5
6
7
8
9
10
#include "Tracy.hpp"
#include "TracyLock.hpp"
TracyLockable(std::mutex, m);
void DoWork() {
std::lock_guard<LockableBase(std::mutex)> lk(m);
ZoneScoped;
/* ... */
}
Lock/unlock times and contention durations of mutexes wrapped with LockableBase are visualized on Tracy’s timeline. You can immediately see which mutex is hot.
Built-in engine profilers
- Unity Profiler: Job System tab shows worker thread utilization and dependency wait time
- Unreal Insights: visualizes TaskGraph, fence wait time, ENQUEUE_RENDER_COMMAND call frequency
- PIX (Xbox/PC): shows D3D12 fence wait and RHI thread blocking
One-line diagnosis
“My game is slow — is the lock the cause?” The one-line answer is “if spinning or waiting shows up in the thread state visualization, that’s the cost.” Without even looking at code, looking only at the thread state graph tells you immediately whether a lock is really the problem.
Conclusion
Pulling together everything from this part:
The nature of race condition:
counter++decomposes into load/modify/store, breaking atomicity- data race (undefined behavior in C++/Rust) and race condition (execution-order dependent) are different concepts
- Locks guarantee atomicity, visibility, and ordering together
The lock family:
- Mutex, Spinlock, Semaphore, RWLock, CondVar, Monitor, Barrier, Latch
- Just different policies over the same atomic foundation
Lock construction:
- Peterson works in theory, real CPUs need memory barriers
- Hardware atomics: x86 LOCK CMPXCHG, ARM LDXR/STXR
- Spin vs Sleep compares critical section length against context-switch cost
OS primitives:
- Linux futex (2002), Windows SRWLock (Vista, 2007), macOS os_unfair_lock (2016)
- All share the same idea: fast path in user-space atomic, slow path only enters kernel
- macOS os_unfair_lock encodes owner thread ID for QoS donation
Hardware mechanism:
- CPU cache: L1 (4 cycle) ~ L3 (50 cycle) ~ DRAM (300 cycle)
- MESI: a cache line can be in Modified state on only one core
- Atomicity comes from cache coherence serializing RFO
- Cache line bouncing: intra-socket 30–50ns, cross-socket 150–300ns
- False sharing: different variables on the same line cause unintended contention
Unity synchronization:
- Main thread model: all Unity APIs main only
- Job System: JobHandle dependency graph, batch 64, work-stealing
- NativeContainer + AtomicSafetyHandle: schedule-time runtime race check (Editor/Development only)
- Burst: ordinary reads/writes are plain load/store; only
Interlocked.*calls emit hardware atomics - DOTS: automatically analyzes Component read/write, scheduler adds dependency edges at schedule time
Unreal synchronization:
- Game / Render / RHI / Audio Thread separation
- TaskGraph + Named Thread + dependency
- ENQUEUE_RENDER_COMMAND works as a lock-free MPSC queue (conceptually)
- FRenderCommandFence is the only explicit Game ↔ Render sync point
Lock-avoiding patterns:
- Double/Triple buffer, lock-free SPSC ring, per-thread slot
- Frame-locked sync concentrates cost at one point
In the next part, Part 11 — Deadlock and Starvation, we cover cases where locks work fine but the program still hangs — circular waits between two locks, priority inversion, livelock. After that, Part 12 covers memory model and atomic ordering, and Part 13 goes deeper into lock-free data structures and Unity Job System internals.
Stage 2’s original question can now be answered.
Why does a program with two threads writing the same variable die only sometimes?
- “Write” decomposes into 3 steps (load/modify/store), so another thread can interleave anywhere in the middle
- “Interleaving” is decided by the scheduler at arbitrary moments, so “sometimes”
- “To prevent it,” we need the lock abstraction built on cache coherence and atomic instructions
References
Textbooks
- Herlihy, M., Shavit, N. — The Art of Multiprocessor Programming, 2nd ed., Morgan Kaufmann, 2020 — Ch.2~7 (Mutex algorithms, lock-free, hardware foundations) — the canonical text on multithread synchronization
- Silberschatz, A., Galvin, P. B., Gagne, G. — Operating System Concepts, 10th ed., Wiley, 2018 — Ch.6 (Synchronization Tools), Ch.7 (Synchronization Examples)
- Tanenbaum, A. S., Bos, H. — Modern Operating Systems, 4th ed., Pearson, 2014 — Ch.2.3 (Interprocess Communication)
- Russinovich, M., Solomon, D., Ionescu, A. — Windows Internals, 7th ed., Microsoft Press, 2017 — Ch.8 (System Mechanisms, SRWLock / pushlock internals)
- Singh, A. — Mac OS X Internals: A Systems Approach, Addison-Wesley, 2006 — Ch.10 (Mach IPC, locks)
- Drepper, U. — What Every Programmer Should Know About Memory, Red Hat, 2007 — the definitive primer on cache coherence — people.freebsd.org/~lstewart/articles/cpumemory.pdf
- Gregory, J. — Game Engine Architecture, 3rd ed., CRC Press, 2018 — Ch.8.6~8.7 (Multithreading, Job systems)
- McKenney, P. E. — Is Parallel Programming Hard, And, If So, What Can You Do About It?, 2024 ed. — kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html — free book by RCU’s author
Papers
- Peterson, G. L. — “Myths About the Mutual Exclusion Problem”, Information Processing Letters, 1981 — Peterson’s original paper
- Dijkstra, E. W. — “Cooperating Sequential Processes”, Programming Languages, 1968 — introduction of the semaphore
- Lamport, L. — “A New Solution of Dijkstra’s Concurrent Programming Problem”, CACM, 1974 — bakery algorithm
- Lamport, L. — “How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs”, IEEE TC, 1979 — definition of sequential consistency
- Franke, H., Russell, R., Kirkwood, M. — “Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux”, OLS 2002 — futex introduction — kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf
- Drepper, U. — “Futexes Are Tricky”, Red Hat, 2011 — pitfalls of futex implementation — akkadia.org/drepper/futex.pdf
- Sweeney, T. et al. — “Concurrent Programming in Unreal Engine” (GDC, EpicGames Dev) — TaskGraph design
- Boehm, H.-J. — “Threads Cannot Be Implemented as a Library”, PLDI 2005 — motivation for C++ memory model
- Adve, S. V., Gharachorloo, K. — “Shared Memory Consistency Models: A Tutorial”, IEEE Computer, 1996 — memory model comparison
Official documentation
- Linux man pages —
futex(2),futex(7),pthread_mutex_lock(3),pthread_rwlock_rdlock(3)— man7.org/linux/man-pages/man2/futex.2.html - Linux Kernel Documentation —
Documentation/locking/futex2.rst,mutex-design.rst,lockdep-design.rst - Microsoft Docs — Slim Reader/Writer (SRW) Locks, Critical Section Objects — learn.microsoft.com/en-us/windows/win32/sync/slim-reader-writer–srw–locks
- Apple Developer — Threading Programming Guide,
os_unfair_lock(3)— developer.apple.com/documentation/os/os_unfair_lock - Intel — Intel 64 and IA-32 Architectures Software Developer’s Manual, Vol. 3A — Ch.8 (Multiple-Processor Management), LOCK prefix
- ARM — ARM Architecture Reference Manual ARMv8-A, B2 (Memory Model), C6 (Load-Acquire / Store-Release)
Unity official
- Unity Manual — C# Job System — docs.unity3d.com/Manual/JobSystem.html
- Unity Manual — Native Containers — docs.unity3d.com/Manual/JobSystemNativeContainer.html
- Unity Manual — Burst Compiler — docs.unity3d.com/Packages/com.unity.burst@latest
- Unity Manual — Entities (DOTS) — docs.unity3d.com/Packages/com.unity.entities@latest
- Joachim Ante — C# Job System and ECS — Unite LA 2018 — Job System design talk
- Lucas Meijer — On DOTS: Entity Component System — Unity Blog, 2019
Unreal official
- Epic Games — Threading in Unreal Engine — dev.epicgames.com/documentation/en-us/unreal-engine/threading-in-unreal-engine
- Epic Games — Task Graph System — dev.epicgames.com/documentation/en-us/unreal-engine/the-task-graph
- Epic Games — Rendering and the Game Thread — explanation of RDG and ENQUEUE_RENDER_COMMAND
- Tim Sweeney — The Next Mainstream Programming Language, POPL 2006 — Unreal’s multithreading vision
Game development / GDC
- Gyrling, C. — Parallelizing the Naughty Dog Engine Using Fibers, GDC 2015 — fiber-based sync — gdcvault.com/play/1022186
- Schreiber, B. — Multithreading the Entire Destiny Engine, GDC 2015 — Bungie’s lock-free design
- Boulton, M. — Threading the Frostbite Engine, GDC 2009 — DICE’s Job system
- Reinders, J., Roberts, B. — Multithreading for Visual Effects, A K Peters, 2014 — VFX engine lock-free patterns
- Vyukov, D. — Lock-Free / 1024cores — Vyukov MPSC, scalability resources — 1024cores.net
Blogs / articles
- Preshing, J. — Preshing on Programming — atomic, memory ordering series — preshing.com
- Howells, D. et al. — Linux Kernel Memory Barriers (
memory-barriers.txt) — official kernel memory model guide - Chen, R. — The Old New Thing — Windows critical section/SRWLock retrospectives
- Giesen, F. — Reading List on Multithreading — fgiesen.wordpress.com
- Oakley, H. — The Eclectic Light Company — macOS os_unfair_lock, QoS observations
- Bonzini, P. — “QEMU and lock-free RCU” — practical RCU application
Tools
- Linux:
perf c2c,perf lock,bpftrace,lockstat - Windows: Concurrency Visualizer (VS), WPA Wait Analysis, PIX (for games)
- macOS: Instruments System Trace,
dtrace,sample - Cross-platform: Tracy Profiler (LockableBase), Intel VTune, AMD μProf
- ThreadSanitizer (TSan): GCC/Clang’s static & dynamic data race detector
This is a long article. You don't have to read it in one sitting.
- First time through — Part 1 (race condition) → Part 2 (lock family) → Part 8 (patterns that avoid locks) gives you the whole picture
- For Stage 2's central answer — "why only sometimes" — go all the way to Part 5 (hardware and MESI)
- If you want the engine internals — Part 6 (Unity) and Part 7 (Unreal) are the main course
- OS implementation when you need it — Part 3 (how locks are built), Part 4 (futex/SRWLock/os_unfair_lock) work well as references
The two deepest sections — MESI state transitions and Burst's compile pipeline — are collapsed. Expand them only when you need them.
