CS Roadmap Part 8 — Processes and Threads: How the OS Abstracts Execution Units
- CS Roadmap (0) — Why CS Knowledge Matters More Than Ever in the AI Era
- CS Roadmap (1) — Arrays and Linked Lists: Reading the Terrain of Memory
- CS Roadmap (2) — Stack, Queue, Deque: Powerful Abstractions Born from Restriction
- CS Roadmap (3) — Hash Tables: Conditions and Limits of O(1) Lookup
- CS Roadmap (4) — Trees: Order, Balance, and Guaranteeing O(log n)
- CS Roadmap (5) — Graphs: The Network of Relationships, the Science of Paths
- CS Roadmap (6) — Memory Management: Stack & Heap, GC, and the Things That Eat Your Frames
- CS Roadmap (Bonus) — Heaps and Priority Queues: The Economics of Partial Order
- CS Roadmap Part 7 — OS Architecture: The Forking Paths of Unix, NT, and XNU
- CS Roadmap Part 8 — Processes and Threads: How the OS Abstracts Execution Units
- CS Roadmap Part 9 — Scheduling: Whom Does the OS Give the CPU To?
- A process is "an isolated address space plus a bundle of resources," and a thread is "a flow of execution inside a process." Threads share code, heap, and globals but keep the stack and registers private
- Unix's fork() clones the process and is then overwritten by exec() in two steps, while Windows' CreateProcess() builds a new one in one shot. The clone looks expensive, but Copy-on-Write makes it actually fast
- Thread models split into 1:1 (Linux NPTL, Windows), N:1 (green threads), M:N (Go goroutines, Erlang) — each trades performance against implementation complexity differently
- A context switch doesn't just save/restore registers; it also causes TLB flushes and cache pollution. Modern game engines therefore move toward "break work into Jobs/TaskGraph/Fibers and distribute over cores" rather than "spawn more threads"
Introduction: From the Map to the Body
The previous post surveyed the lineage and skeleton of the three operating systems: Linux monolithic, Windows NT hybrid, macOS XNU a Mach + BSD dual structure. If that was the map, this post is the body of the journey.
Let’s bring Stage 2’s key question back.
“When two threads use the same variable, why does the program crash only sometimes?”
To answer this, we first need to know “what a thread is” precisely. And to understand threads, we need to first understand their parent concept, the process. The distinction between processes and threads, how the two share and separate memory, and how the OS abstracts all of this — these are the starting points of every concurrency problem.
What we cover in this post:
- Processes: PCBs and address-space layouts. Linux’s
task_struct, Windows’EPROCESS, macOS’sproc/task - Process creation: Unix’s two-step
fork()+exec()model, Windows’ single-callCreateProcess(), and Copy-on-Write - Threads: why processes alone aren’t enough, TCBs, shared vs. private regions, TLS
- Thread mapping models: 1:1, N:1, M:N — why Go’s goroutines are so cheap
- Context switching: the real cost of registers, TLBs, and caches
- Game engine execution models: Unity’s main thread, Unreal’s TaskGraph, Naughty Dog’s fibers
We keep the game-dev lens throughout, but this post carries more foundational theory than previous ones. The next post (scheduling) and the one after (synchronization) are built on top of it.
Part 1: Processes — The Execution Unit the OS Sees
What Is a Process?
Start with the textbook definition. A process is a program in execution. The .exe file on disk or the Mach-O binary is a program; the instance of it loaded into memory and running on the CPU is a process.
A process owns:
- A unique address space — memory isolated from other processes
- Execution state — CPU register values, program counter
- An open-files table — the list of file descriptors currently in use
- Ownership info — UID, GID, permissions
- Parent-child relationships — who created whom (the process tree)
The OS manages all of this via a single struct. That’s the PCB (Process Control Block), also called the process descriptor.
The PCB in Practice — Per-OS Structs
Linux — task_struct
In the Linux kernel, processes (and threads) are represented by struct task_struct. It’s defined in include/linux/sched.h and is a huge struct with hundreds of fields.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
/* Linux kernel task_struct (kernel 6.x, heavily simplified) */
struct task_struct {
/* State */
unsigned int __state; /* TASK_RUNNING etc. */
/* Identifiers */
pid_t pid; /* process id */
pid_t tgid; /* thread group id */
struct task_struct *parent; /* parent process */
struct list_head children; /* child list */
/* Memory */
struct mm_struct *mm; /* address space */
/* Files */
struct files_struct *files; /* open files table */
/* Scheduling */
int prio;
struct sched_entity se; /* CFS scheduling entity */
/* signals, resource limits, and hundreds more... */
};
The real struct is over 700 lines. Crucially, in Linux a process and a thread share the same struct. We come back to this peculiarity later.
Windows — EPROCESS, KPROCESS
Windows NT splits across two layers:
KPROCESS(Kernel Process Block) — minimal scheduling-related infoEPROCESS(Executive Process Block) — wrapsKPROCESSand adds more
1
2
3
4
5
6
7
8
9
/* Conceptual pseudocode — see WinDbg or leaked NT sources for the real thing */
typedef struct _EPROCESS {
KPROCESS Pcb; /* kernel process block (inherited) */
HANDLE UniqueProcessId; /* PID */
LIST_ENTRY ActiveProcessLinks; /* global process list */
PVOID SectionBaseAddress; /* image load address */
PVOID Token; /* security token */
/* ... */
} EPROCESS;
macOS — proc + task
macOS’s dual structure shows up here too. The BSD layer holds the classic Unix struct proc; the Mach layer holds struct task.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/* BSD side — bsd/sys/proc_internal.h */
struct proc {
pid_t p_pid; /* POSIX process ID */
struct proc *p_pptr; /* parent */
struct task *task; /* link to Mach task */
/* ... */
};
/* Mach side — osfmk/kern/task.h */
struct task {
queue_head_t threads; /* threads belonging to this task */
vm_map_t map; /* address space */
ipc_space_t itk_space; /* Mach port space */
/* ... */
};
So when fork() creates a process on macOS, a BSD proc and a Mach task are created as a pair. Unix tools (ps, top) look at the proc; Mach-based tools (lldb, Instruments) look at the task.
Process Address-Space Layout
How is a process’s memory laid out? Here’s the classic Unix/Linux 32-bit layout.
A tour of each region (from low address upward):
- Text (
.text): executable machine code. Allowed read + execute only; writes cause a segfault - Read-only data (
.rodata): string literals ("Hello"), constant arrays. Read-only - Data (
.data): initialized globals and statics (int x = 42;). The initial values sit in the file - BSS (Block Started by Symbol): zero-initialized globals (
int x;,static char buf[1024];). The file only records the size; the OS zeroes out the memory at execution — a trick to shrink the binary on disk - Heap: dynamic allocations (
malloc,new). Grows upward via thebrk()syscall - Shared library region (mmap):
libc.so,libstdc++.soetc. are mapped here viammap() - Stack: call frames, locals, return addresses. Grows downward
- Kernel space: kernel code and data. User processes have no direct access. On 32-bit Linux it’s the top 1 GB; on x86-64 it’s the top half
Windows uses different section names in PE but the structure is nearly the same (.text, .data, .rdata, .bss).
Process States
A process moves through multiple states. The standard model from Silberschatz:
- New: process just created
- Ready: runnable but waiting for a CPU
- Running: actually executing on a CPU
- Waiting (or Blocked): waiting for I/O or an event
- Terminated: exited
Real OSes have many more states. Linux’s task_struct has TASK_RUNNING, TASK_INTERRUPTIBLE, TASK_UNINTERRUPTIBLE, TASK_STOPPED, TASK_TRACED, TASK_DEAD, TASK_WAKEKILL, TASK_WAKING, TASK_PARKED, and more. The letters S, R, D, Z you see in ps are these states.
1
2
3
4
5
$ ps aux
USER PID %CPU %MEM COMMAND
root 1 0.0 0.1 /sbin/init <- S (sleeping)
www 1234 2.1 1.5 nginx: worker <- R (running)
root 5678 0.0 0.0 [kworker/u8:2] <- D (uninterruptible sleep)
D state (uninterruptible sleep) matters for game developers too — it means waiting on disk I/O or a driver request, and even kill -9 doesn’t work in this state. A lot of “unresponsive processes” are stuck in D.
Part 2: Process Creation — fork, exec, CreateProcess
Now for how processes are created. This is where the philosophical differences among the three OSes become sharpest.
Unix: fork() + exec() — The Two-Step Model
Unix’s idea is “duplicate the parent, then overwrite.”
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#include <unistd.h>
#include <sys/wait.h>
int main() {
pid_t pid = fork(); /* step 1: clone yourself */
if (pid == 0) {
/* child */
execl("/bin/ls", "ls", "-l", NULL); /* step 2: overwrite with a new program */
/* not reached if execl succeeds */
} else if (pid > 0) {
/* parent */
int status;
waitpid(pid, &status, 0); /* wait for the child */
} else {
perror("fork failed");
}
return 0;
}
A single call to fork() returns twice. It returns the child’s PID to the parent and 0 to the child. An odd API.
What fork() does (naive implementation):
- Create a new PCB (
task_struct) - Copy the parent’s entire address space (text, data, heap, stack all)
- Copy open file descriptors too
- Assign a new PID to the child
- Put the child on the ready queue
Step 2 is the problem. When a process’s address space is hundreds of MB, copying it every time is hugely expensive. And if exec() is called right after fork(), the address space is overwritten anyway — you copied only to throw away.
Copy-on-Write — “Actually Copy Only On Write”
The answer is Copy-on-Write (COW). At fork() time, only the page tables are copied, and the actual memory pages are shared between parent and child — but marked read-only.
When either side tries to write to a page, the hardware raises a page fault, and only then does the OS copy that one page.
COW requires hardware support — the CPU’s MMU (Memory Management Unit) must enforce per-page protection and raise page faults, otherwise the OS has no hook to intervene. Page-level MMU is the foundation for nearly every modern-OS trick (COW, swap, mmap, shared memory).
Windows: CreateProcess() — A Single Call
Windows took a different path. There is no parent-cloning concept; it builds a new process from scratch.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#include <windows.h>
int main() {
STARTUPINFO si = { sizeof(si) };
PROCESS_INFORMATION pi;
BOOL ok = CreateProcess(
"C:\\Windows\\System32\\notepad.exe", /* executable */
NULL, /* command line */
NULL, NULL, /* process/thread security */
FALSE, /* inherit handles? */
0, /* creation flags */
NULL, NULL, /* environment, working dir */
&si, &pi);
if (ok) {
WaitForSingleObject(pi.hProcess, INFINITE);
CloseHandle(pi.hProcess);
CloseHandle(pi.hThread);
}
return 0;
}
Unix’s fork() takes no parameters; CreateProcess() takes ten. That’s the Windows philosophy of “stuff every configurable option for process creation into one function.”
Trade-offs:
| Aspect | Unix fork()+exec() | Windows CreateProcess() |
|---|---|---|
| API complexity | Two steps, each simple | One step, many parameters |
| Process creation cost | Very cheap via COW | Relatively expensive |
| Shell implementation | Natural (fork → set up redirections → exec) | Needs a separate API like ShellExecute |
| Security | Parent handles inherit automatically (error-prone) | Inheritance is explicit |
| Flexibility | Arbitrary code between fork and exec | Only at creation time |
macOS — Unix Inheritance Plus a Few Twists
macOS comes from BSD, so naturally it supports fork() and exec(). But XNU’s internal implementation is slightly distinctive.
When BSD’s fork() is mapped down to Mach, what actually happens is:
- Clone the current
procstruct - Clone the current
taskat the Mach level (task_create()) - Create an initial thread (
thread_create()) - Clone the address space too (Mach’s vm_map, COW)
That is, a single BSD fork() call decomposes into several Mach-level operations. This is the practical face of the XNU dual structure.
Also interesting is macOS’s posix_spawn(). A POSIX standard that Apple actively promotes, it performs fork+exec in one call.
1
posix_spawn(&pid, "/bin/ls", NULL, NULL, argv, environ);
Why prefer it? Because of iOS. On iOS, fork() is forbidden for security reasons, and only posix_spawn() is allowed. The internal implementation can also be more efficient (it may even skip the COW page-table clone).
Hold on, let’s clarify this
“Why is fork() banned on iOS?”
Three reasons overlap.
- Sandbox-escape risk: a forked child inherits its parent’s privileges, and in iOS’s strict app sandbox model this boundary becomes a potential avenue for vulnerabilities
- Objective-C runtime state duplication: iOS apps are usually written in Objective-C or Swift, whose runtimes initialize lots of state at startup (threads, GCD queues, IOKit connections, etc.). Post-fork, this state easily falls out of consistency
- Memory efficiency: iOS is memory-constrained, and even COW still needs page-table cloning.
posix_spawn()can skip even thatOn macOS,
fork()is still allowed, but Apple recommendsposix_spawn()where possible.
Part 3: Threads — Why Processes Alone Aren’t Enough
Limits of Process-Based Concurrency
In 1970s–80s Unix, one process meant one execution flow. To do multiple things at once, you fork()ed multiple processes. A web server would create one process per connection (classic Apache prefork).
Problems with this model:
- Process creation cost: cheaper thanks to COW, but still microseconds to milliseconds for page-table cloning, PCB allocation, etc.
- Context switch cost: switching between processes also changes the address space, so TLB flushes are required (details below)
- IPC cost: since processes have separate address spaces, exchanging data requires heavy machinery like pipes, sockets, or shared memory
- Expressing shared state is hard: when multiple flows need to share the same data structure, it gets complicated
By the 1990s a solution was needed, and that was the thread.
Definition of a Thread
A thread is an independent flow of execution within a process. When multiple threads exist in one process, they all share the same address space but can execute simultaneously on CPUs.
What threads share:
- Text (code): naturally, they execute the same code
- Heap: memory allocated with
malloc - Data / BSS: globals and statics
- Open file descriptors
- Signal handlers
What threads keep private:
- Stack: each thread has its own
- CPU register state: PC, SP, general-purpose regs
- TLS (Thread-Local Storage): per-thread globals
- Error state:
errno(in POSIX it’s per-thread)
Key takeaways from this diagram:
- Threads share heap and globals by default — “shared memory” exists naturally
- So two threads both doing
counter++on the sameint countercreates a race condition - Two processes, by contrast, are naturally isolated because their address spaces are separate
The answer to Stage 2’s key question — “why does the program crash only sometimes when two threads use the same variable?” — is hidden in this diagram. Threads intentionally share memory, so concurrency issues arise, and they need synchronization techniques to manage them. (We cover that thoroughly in Part 10: Synchronization Primitives.)
TCB — The Thread Control Block
Just as processes have PCBs, threads have TCBs (Thread Control Blocks). A TCB holds:
- Thread ID
- CPU register state (saved context)
- Thread state (Running, Ready, Waiting)
- Stack pointer, stack base
- Scheduling info (priority etc.)
- Pointer to the owning process
Per-OS implementation:
- Linux:
task_struct— processes and threads use the same struct, distinguished by which fields they share - Windows:
KTHREAD+ETHREAD - macOS: Mach’s
struct thread
Linux’s Peculiar Philosophy — “Processes and Threads Are the Same”
Linus Torvalds made a bold decision in the 1990s: “Don’t make processes and threads separate concepts; unify them as a single ‘execution unit.’“
In Linux, instead of fork(), there’s the more general clone() syscall. clone() specifies “what to share with the parent” as a bit flag.
1
2
3
4
5
6
7
8
9
10
/* Linux clone() — concept */
clone(fn, stack, flags, arg);
/* Example flags: */
CLONE_VM /* share address space (true → thread, false → process) */
CLONE_FS /* share filesystem state */
CLONE_FILES /* share file descriptors */
CLONE_SIGHAND /* share signal handlers */
CLONE_THREAD /* same thread group */
/* ... */
fork()=clone()with all share flags OFFpthread_create()=clone()with all share flags ON- Any combination in between is possible
That’s Linux’s “process and thread are on a continuum” worldview. Android for example uses “partially sharing” process clones in practice (the Zygote process).
TLS — Thread-Local Storage
Sometimes you need variables that look global but are actually independent per thread. That’s TLS.
The canonical example: errno. In POSIX, errno is “the error code of the last syscall,” but it must be per-thread (thread A’s failed read() must not be overwritten by thread B). So errno is implemented as TLS.
TLS declarations by language:
1
2
3
4
5
/* C11 */
_Thread_local int counter = 0;
/* GCC/Clang extension */
__thread int counter = 0;
1
2
// C++11
thread_local int counter = 0;
1
2
3
4
5
6
// C#
[ThreadStatic]
static int counter;
// Or the more flexible ThreadLocal<T>
static ThreadLocal<int> counter = new ThreadLocal<int>(() => 0);
Practical uses in game development:
- Logging systems use TLS to store each thread’s name for inclusion in log lines
- Rendering assigns per-thread command buffers that are merged later
- Profilers track the current scope stack per thread
Part 4: Thread Models — 1:1, N:1, M:N
A deeper question: when you call pthread_create() or new Thread(), how does the kernel manage that thread?
Why the Question Matters
The unit that actually runs on a CPU is a kernel-level thread (KLT). Only the kernel schedules the CPU.
In contrast, the “thread” your program creates can be just a user-space abstraction. That’s called a user-level thread (ULT).
The mapping between user threads and kernel threads falls into three categories.
1:1 Model — The Current Linux/Windows Choice
In 1:1, each user-created thread maps to exactly one kernel thread. pthread_create() internally calls the clone() syscall and directly creates a kernel-managed task.
Linux NPTL (Native POSIX Thread Library): Since Linux 2.6, the glibc pthread implementation uses NPTL, a 1:1 model. Before that there was LinuxThreads, a nonstandard 1:1 implementation, which NPTL replaced on POSIX compliance and performance grounds.
Windows: CreateThread() creates a KTHREAD directly in the kernel. 1:1 again.
Pros: if one thread blocks, others keep running. Natural distribution across cores.
Cons: thread creation is relatively expensive; tens of thousands or more strain kernel memory.
N:1 Model — A Legacy
In N:1, multiple user threads map to one kernel thread. The kernel doesn’t know the process has multiple threads — it sees just one process.
This model was used in early Java “green threads,” in GNU Pth, and others. It was standard in the early 1990s but fatal drawbacks nearly drove it extinct:
- A blocking syscall stops everyone: one user thread blocking in
read()freezes all others sharing the kernel thread - No multicore use: one kernel thread lives on one core
M:N Model — Go’s Choice
M:N combines the two. M user threads map dynamically to a pool of N kernel threads (typically N = number of CPU cores).
Representative implementations:
- Go goroutines: the Go runtime has an M:N scheduler, running millions of goroutines over a handful of OS threads
- Erlang/Elixir: the BEAM VM implements its own scheduler
- Old Solaris (Solaris 2–8): implemented POSIX pthreads M:N, but Solaris 9 switched to 1:1 for complexity reasons
Theoretical grounding — Anderson et al.’s SOSP 1991 Scheduler Activations paper tackles “what kernel support is needed so a user-level thread library can efficiently implement M:N.” The key is that on blocking syscalls the kernel should wake the user scheduler so it can assign another user thread to another kernel thread.
The Go runtime implements a similar idea. When a goroutine tries a blocking syscall, the runtime detects it and either migrates that goroutine to another kernel thread or spawns a new one. So one net.Listen blocking doesn’t affect other goroutines.
From a Game Development Perspective
The threads used in Unity and Unreal are 1:1 at the C++/C# layer. new Thread() or std::thread creates kernel threads directly.
However, engines’ internal Job systems or Task graphs are effectively M:N schedulers. The programmer can queue thousands of “Jobs,” yet they run on the engine’s handful of worker threads. This ties directly into the Unity Job System design we address in detail in Part 13 (Lock-free and Structural Solutions).
Part 5: Three-OS Thread APIs Compared
Linux — pthreads
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#include <pthread.h>
#include <stdio.h>
void* worker(void* arg) {
int id = *(int*)arg;
printf("Thread %d running\n", id);
return NULL;
}
int main() {
pthread_t t1, t2;
int id1 = 1, id2 = 2;
pthread_create(&t1, NULL, worker, &id1);
pthread_create(&t2, NULL, worker, &id2);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
return 0;
}
POSIX-standard API. Internally it calls the clone() syscall. Officially it’s “pthread,” but Linux man pages are really documenting NPTL (the glibc implementation).
Windows — CreateThread / _beginthreadex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#include <windows.h>
#include <process.h>
unsigned __stdcall worker(void* arg) {
int id = *(int*)arg;
printf("Thread %d running\n", id);
return 0;
}
int main() {
HANDLE t1, t2;
int id1 = 1, id2 = 2;
t1 = (HANDLE)_beginthreadex(NULL, 0, worker, &id1, 0, NULL);
t2 = (HANDLE)_beginthreadex(NULL, 0, worker, &id2, 0, NULL);
WaitForSingleObject(t1, INFINITE);
WaitForSingleObject(t2, INFINITE);
CloseHandle(t1);
CloseHandle(t2);
return 0;
}
Why not CreateThread? CreateThread skips CRT (C Runtime Library) initialization — so thread-local state like errno and strtok isn’t set up, causing subtle bugs. _beginthreadex initializes the CRT, so for C/C++ code you should use it.
macOS — pthreads + libdispatch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
/* POSIX style — same as Linux */
#include <pthread.h>
/* ... */
/* libdispatch (GCD) style — Apple's preferred */
#include <dispatch/dispatch.h>
int main() {
dispatch_async(dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0), ^{
printf("Running in background\n");
dispatch_async(dispatch_get_main_queue(), ^{
printf("Back to main thread\n");
});
});
dispatch_main();
return 0;
}
macOS supports pthreads too, but Apple recommends GCD (Grand Central Dispatch). We covered the rationale in Part 7 — no manual thread lifetime management, QoS-based routing to P/E cores, a predictable queue abstraction.
C# — Language-Level Abstraction
C# works on all three OSes. The .NET runtime (CLR or CoreCLR) hides OS differences.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
using System;
using System.Threading;
using System.Threading.Tasks;
// 1) Most primitive — rarely used today
Thread t = new Thread(() => Console.WriteLine("Hello"));
t.Start();
t.Join();
// 2) ThreadPool — thread reuse
ThreadPool.QueueUserWorkItem(_ => Console.WriteLine("Hello"));
// 3) Task / async-await — modern default
await Task.Run(() => HeavyComputation());
// 4) Parallel — data parallelism
Parallel.For(0, 100, i => ProcessItem(i));
Underneath:
- Linux: libcoreclr uses
pthread_create() - Windows: uses
CreateThread() - macOS: uses
pthread_create()(doesn’t use GCD directly)
Unity’s quirk: Unity discourages Thread usage. It nudges you to Job System, UniTask, and coroutines instead. Because most Unity Engine APIs crash if called outside the main thread. (See Part 13 for details.)
Part 6: Context Switching — Why It’s Expensive
What a Context Switch Is
To run multiple threads alternately on one CPU core, the OS saves the current thread’s state and restores the next thread’s state. That’s context switching.
What must be saved:
- CPU registers: RAX, RBX, …, RIP (program counter), RSP (stack pointer), flags
- Floating-point registers: XMM, YMM, ZMM (tens of KB in the AVX era)
- MMU state: on a process switch, the page table pointer (CR3 on x86) changes too
The “Hidden Cost” of Context Switching
Saving and restoring registers is just the tip of the iceberg. The real cost is indirect effects.
TLBs and Process-to-Process Switches
The TLB (Translation Lookaside Buffer) is a small CPU cache that stores “virtual address → physical address” lookups. Typical L1 TLBs have 64–128 entries.
When a process switch occurs, the CR3 register (the page-table base) changes, and the TLB is fully flushed (absent PCID/ASID optimizations). Every subsequent memory access then has to walk the page tables again.
Thread-to-thread switches are cheaper — threads share an address space, so CR3 doesn’t change and the TLB isn’t flushed. That’s one concrete reason “threads are lighter than processes.”
Measuring It
On Linux you can measure with perf stat:
1
2
3
4
5
6
7
$ perf stat -e context-switches,cpu-migrations,cache-misses -p <PID> sleep 10
Performance counter stats for process id '1234':
12,345 context-switches
567 cpu-migrations
10,234,567 cache-misses
On macOS, Instruments’s System Trace template lets you observe thread scheduling and context switches at microsecond resolution.
On Windows, Xperf and Windows Performance Analyzer fill the same role.
A LaMarca & Ladner Observation
As LaMarca & Ladner 1996, “The Influence of Caches on the Performance of Heaps” argues, theoretical asymptotic complexity alone cannot predict real performance. By the same token, the naive expectation that “more threads = faster” breaks down because of cache/TLB costs.
The rule “optimal thread count = core count” comes from this observation. Beyond that, context switching eats the gains.
Part 7: Game Engine Execution Models
Now we link theory to game engines.
Unity — The Hard Main-Thread Constraint
If you’ve used Unity, you’ve likely seen the warning “this API can only be called on the main thread.” Most Unity Engine APIs — Transform.position, GameObject.Instantiate(), Renderer.sharedMaterial, etc. — are main-thread only.
Why?
The Unity Engine is written in C++ and its internal data structures have no locks. Unity’s team assumed “all engine calls come from the main thread” by design, eliminating lock-acquisition overhead.
This is a deliberate trade-off:
- ✅ Engine calls are very fast (no locks)
- ❌ Multithreaded use is awkward
Unity’s answer: Job System + Burst + Native Containers. Leave the main thread alone and provide a separate layer that parallelizes only the data processing. (Details in Part 13.)
Unreal Engine — The Task Graph
Unreal Engine uses a Task Graph system. “Tasks” submitted by game code form a dependency DAG that the engine spreads across a worker thread pool.
Unreal’s worker threads:
- Game Thread: game logic (Unity’s main thread equivalent)
- Render Thread: build rendering commands
- RHI Thread: GPU driver calls
- Worker Threads: general-purpose work
Tasks specify their target via ENamedThreads. Examples: ENamedThreads::GameThread, ENamedThreads::AnyBackgroundHiPriTask.
Fiber — Naughty Dog’s Approach
Christian Gyrling’s GDC 2015 talk “Parallelizing the Naughty Dog Engine Using Fibers” is famous for its fiber-based engine design.
A fiber is a cooperative user-level thread. The OS isn’t involved; the application switches them itself. If kernel threads are workers, fibers are the tasks the workers are carrying at the moment.
- Fiber creation cost: extremely cheap (nanoseconds)
- Fiber switch: save/restore registers only, no kernel involvement
- Can dispatch thousands
Naughty Dog’s The Last of Us Part II used this system to reliably exploit the PS4’s 7 cores. Fibers can be viewed as one form of the M:N model (fibers = user threads, kernel threads = workers).
Windows fiber API: CreateFiber, SwitchToFiber. On macOS/Linux you’d use ucontext.h’s makecontext/swapcontext (legacy, discouraged) or libraries like Boost.Context and libco.
Engine Execution Models Compared
Part 8: Hands-On — How Are My Threads Actually Running?
Once you know the theory, it’s time to actually look. All three OSes ship rich tools for observing processes and threads.
Linux — /proc, ps, top
On Linux everything is exposed in the /proc virtual filesystem.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# List threads of a specific process
$ ls /proc/<PID>/task/
1234 1235 1236 ...
# State of each thread
$ cat /proc/1234/task/1234/status
Name: myapp
State: R (running)
Tgid: 1234
Pid: 1234
Threads: 8
# Address-space mappings
$ cat /proc/1234/maps
00400000-00452000 r-xp 00000000 08:01 12345 /usr/bin/myapp
00651000-00652000 r--p 00051000 08:01 12345 /usr/bin/myapp
7f1234000000-7f1234021000 r-xp 00000000 08:01 54321 /lib/x86_64-linux-gnu/libc.so.6
...
top -H shows per-thread CPU usage.
macOS — Activity Monitor, ps, Instruments
Activity Monitor is the GUI tool, but more precise data lives in CLI tools.
1
2
3
4
5
# Show thread count for a process
$ ps -M <PID>
# Detailed info
$ sample <PID> 5 -mayDie
The most powerful option is Instruments’ System Trace template. It shows a per-P/E-core execution timeline, context-switch events, and blocking causes. It’s especially useful on Apple Silicon — visualizing which threads ran on P-cores and which were pushed to E-cores.
Windows — Process Explorer, WPA
Process Explorer (Sysinternals) is a beefed-up Task Manager:
- Visualized process tree
- Thread list per process with stack traces
- Handles, DLLs, memory details
Windows Performance Analyzer (WPA) is the Instruments equivalent, analyzing ETW events collected via Xperf.
Threads in C# — Code Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
using System;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;
class ThreadInspector {
static void Main() {
Console.WriteLine($"Current process ID: {Process.GetCurrentProcess().Id}");
Console.WriteLine($"Managed thread ID: {Thread.CurrentThread.ManagedThreadId}");
Console.WriteLine($"CPU core count: {Environment.ProcessorCount}");
// Measure thread-creation cost
var sw = Stopwatch.StartNew();
var threads = new Thread[100];
for (int i = 0; i < 100; i++) {
threads[i] = new Thread(() => Thread.Sleep(1));
threads[i].Start();
}
foreach (var t in threads) t.Join();
sw.Stop();
Console.WriteLine($"100 thread create+join: {sw.ElapsedMilliseconds}ms");
// ThreadPool.QueueUserWorkItem is much faster
sw.Restart();
var countdown = new CountdownEvent(100);
for (int i = 0; i < 100; i++) {
ThreadPool.QueueUserWorkItem(_ => {
Thread.Sleep(1);
countdown.Signal();
});
}
countdown.Wait();
sw.Stop();
Console.WriteLine($"100 ThreadPool items: {sw.ElapsedMilliseconds}ms");
}
}
Run output (approximate on my machine):
1
2
3
4
5
Current process ID: 12345
Managed thread ID: 1
CPU core count: 8
100 thread create+join: 85ms
100 ThreadPool items: 8ms
10× difference. That’s the practical case for thread pools. .NET’s ThreadPool, Java’s ExecutorService, C++’s std::async all share the idea — amortize creation cost by reusing threads.
Wrap-up
What this post covered:
Processes:
- PCB (
task_struct,EPROCESS,proc+task) — how the OS tracks a process - Address space layout: Text, Data, BSS, Heap, Stack, Kernel
- State transitions: New, Ready, Running, Waiting, Terminated
Process creation:
- Unix
fork() + exec()— two steps, Copy-on-Write keeps it fast - Windows
CreateProcess()— one step, many parameters - macOS
posix_spawn()— iOS-compatible, more efficient - COW in fork() relies on hardware MMU support
Threads:
- Process vs thread: whether the address space is shared is the key
- Shared: text, data, heap, file descriptors
- Private: stack, registers, TLS
- Linux’s peculiar philosophy: same struct for process and thread (
clone())
Thread mapping models:
- 1:1 (Linux NPTL, Windows): standard, true parallelism
- N:1 (old green threads): nearly obsolete
- M:N (Go goroutines, Erlang): millions of concurrent threads, complex runtime
Context switching:
- Direct cost: register save/restore ~1–10μs
- Hidden cost: TLB flush, cache pollution, branch predictor pollution
- Process switches are more expensive than thread switches (CR3 change)
- “Thread count = core count” rule
Game engine execution models:
- Unity: main thread constraint + Job System (data parallelism)
- Unreal: multiple named threads + Task Graph
- Naughty Dog engine: fiber-based cooperative scheduling
The next post is Part 9 Scheduling — when several threads are all Ready, who does the OS hand the CPU to? We look at Linux’s CFS → EEVDF, Windows’ priority boost, and macOS’s QoS-based scheduling. We also cover the 16.67 ms game frame budget and the priority-inversion problem.
References
Textbooks
- Silberschatz, Galvin, Gagne — Operating System Concepts, 10th ed., Wiley, 2018 — Ch.3 (Processes), Ch.4 (Threads)
- Bovet, Cesati — Understanding the Linux Kernel, 3rd ed., O’Reilly, 2005 —
task_structand process management Ch.3 - Mauerer — Professional Linux Kernel Architecture, Wrox, 2008 — modern Linux kernel internals
- Russinovich, Solomon, Ionescu — Windows Internals, 7th ed., Microsoft Press, 2017 — EPROCESS/ETHREAD details
- Singh — Mac OS X Internals: A Systems Approach, Addison-Wesley, 2006 — XNU’s task/proc dual structure
- Butenhof — Programming with POSIX Threads, Addison-Wesley, 1997 — the classic pthreads reference
- Stevens, Rago — Advanced Programming in the UNIX Environment, 3rd ed., Addison-Wesley, 2013 — fork/exec in practice
- Gregory — Game Engine Architecture, 3rd ed., CRC Press, 2018 — Ch.8 multiprocessor engine design
Papers
- Anderson, Bershad, Lazowska, Levy — “Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism”, SOSP 1991 — DOI — theoretical basis for the M:N model
- Mogul, Borg — “The Effect of Context Switches on Cache Performance”, ASPLOS 1991 — measuring the hidden cost of context switches
- Engelschall — “Portable Multithreading: The Signal Stack Trick for User-Space Thread Creation”, USENIX 2000 — implementing user-level threads
- Kleiman, Smaalders — “The LWP Framework: Building and Debugging Mach Tasks and Threads”, Mach Workshop 1990 — Mach’s thread model
Official Docs
- Linux man pages —
clone(2),fork(2),pthread_create(3),proc(5)— man7.org - Apple Developer — Threading Programming Guide — developer.apple.com
- Apple Developer — Dispatch — developer.apple.com/documentation/dispatch
- Microsoft Docs — Processes and Threads — learn.microsoft.com
- Microsoft Docs — Fibers — learn.microsoft.com
- Go Runtime — The Go Scheduler (Dmitry Vyukov) — morsmachine.dk/go-scheduler
Game Development / GDC Resources
- Gyrling, C. — Parallelizing the Naughty Dog Engine Using Fibers, GDC 2015 — gdcvault.com
- Unity Technologies — C# Job System manual — docs.unity3d.com
- Unreal Engine Documentation — Task Graph System — dev.epicgames.com
- Fabian Giesen — Reading List on Multithreading and Synchronization — fgiesen.wordpress.com
Blogs / Articles
- Raymond Chen — The Old New Thing — internals of Win32 CreateProcess
- Linus Torvalds — early comp.os.minix thread-related discussions (1992)
- Dmitry Vyukov — 1024cores.net — lock-free concurrency reference (including Go scheduler internals)
- Howard Oakley — The Eclectic Light Company — macOS thread-observability techniques
Tools
- Linux:
ps,top,htop,strace,perf,ftrace - macOS: Activity Monitor,
ps,sample, Instruments (System Trace, Time Profiler) - Windows: Task Manager, Process Explorer, WPA, PerfView
- Cross-platform: Tracy Profiler — great for embedding in games
