Post

CS Roadmap Part 8 — Processes and Threads: How the OS Abstracts Execution Units

CS Roadmap Part 8 — Processes and Threads: How the OS Abstracts Execution Units
Prerequisites — Read these first
TL;DR — Key Takeaways
  • A process is "an isolated address space plus a bundle of resources," and a thread is "a flow of execution inside a process." Threads share code, heap, and globals but keep the stack and registers private
  • Unix's fork() clones the process and is then overwritten by exec() in two steps, while Windows' CreateProcess() builds a new one in one shot. The clone looks expensive, but Copy-on-Write makes it actually fast
  • Thread models split into 1:1 (Linux NPTL, Windows), N:1 (green threads), M:N (Go goroutines, Erlang) — each trades performance against implementation complexity differently
  • A context switch doesn't just save/restore registers; it also causes TLB flushes and cache pollution. Modern game engines therefore move toward "break work into Jobs/TaskGraph/Fibers and distribute over cores" rather than "spawn more threads"
Visitors

Hits

Introduction: From the Map to the Body

The previous post surveyed the lineage and skeleton of the three operating systems: Linux monolithic, Windows NT hybrid, macOS XNU a Mach + BSD dual structure. If that was the map, this post is the body of the journey.

Let’s bring Stage 2’s key question back.

“When two threads use the same variable, why does the program crash only sometimes?”

To answer this, we first need to know “what a thread is” precisely. And to understand threads, we need to first understand their parent concept, the process. The distinction between processes and threads, how the two share and separate memory, and how the OS abstracts all of this — these are the starting points of every concurrency problem.

What we cover in this post:

  • Processes: PCBs and address-space layouts. Linux’s task_struct, Windows’ EPROCESS, macOS’s proc/task
  • Process creation: Unix’s two-step fork()+exec() model, Windows’ single-call CreateProcess(), and Copy-on-Write
  • Threads: why processes alone aren’t enough, TCBs, shared vs. private regions, TLS
  • Thread mapping models: 1:1, N:1, M:N — why Go’s goroutines are so cheap
  • Context switching: the real cost of registers, TLBs, and caches
  • Game engine execution models: Unity’s main thread, Unreal’s TaskGraph, Naughty Dog’s fibers

We keep the game-dev lens throughout, but this post carries more foundational theory than previous ones. The next post (scheduling) and the one after (synchronization) are built on top of it.


Part 1: Processes — The Execution Unit the OS Sees

What Is a Process?

Start with the textbook definition. A process is a program in execution. The .exe file on disk or the Mach-O binary is a program; the instance of it loaded into memory and running on the CPU is a process.

A process owns:

  1. A unique address space — memory isolated from other processes
  2. Execution state — CPU register values, program counter
  3. An open-files table — the list of file descriptors currently in use
  4. Ownership info — UID, GID, permissions
  5. Parent-child relationships — who created whom (the process tree)

The OS manages all of this via a single struct. That’s the PCB (Process Control Block), also called the process descriptor.

The PCB in Practice — Per-OS Structs

Linux — task_struct

In the Linux kernel, processes (and threads) are represented by struct task_struct. It’s defined in include/linux/sched.h and is a huge struct with hundreds of fields.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
/* Linux kernel task_struct (kernel 6.x, heavily simplified) */
struct task_struct {
    /* State */
    unsigned int           __state;          /* TASK_RUNNING etc. */

    /* Identifiers */
    pid_t                  pid;              /* process id */
    pid_t                  tgid;             /* thread group id */
    struct task_struct    *parent;           /* parent process */
    struct list_head       children;         /* child list */

    /* Memory */
    struct mm_struct      *mm;               /* address space */

    /* Files */
    struct files_struct   *files;            /* open files table */

    /* Scheduling */
    int                    prio;
    struct sched_entity    se;               /* CFS scheduling entity */

    /* signals, resource limits, and hundreds more... */
};

The real struct is over 700 lines. Crucially, in Linux a process and a thread share the same struct. We come back to this peculiarity later.

Windows — EPROCESS, KPROCESS

Windows NT splits across two layers:

  • KPROCESS (Kernel Process Block) — minimal scheduling-related info
  • EPROCESS (Executive Process Block) — wraps KPROCESS and adds more
1
2
3
4
5
6
7
8
9
/* Conceptual pseudocode — see WinDbg or leaked NT sources for the real thing */
typedef struct _EPROCESS {
    KPROCESS Pcb;                    /* kernel process block (inherited) */
    HANDLE UniqueProcessId;          /* PID */
    LIST_ENTRY ActiveProcessLinks;   /* global process list */
    PVOID SectionBaseAddress;        /* image load address */
    PVOID Token;                     /* security token */
    /* ... */
} EPROCESS;

macOS — proc + task

macOS’s dual structure shows up here too. The BSD layer holds the classic Unix struct proc; the Mach layer holds struct task.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/* BSD side — bsd/sys/proc_internal.h */
struct proc {
    pid_t                  p_pid;           /* POSIX process ID */
    struct proc           *p_pptr;          /* parent */
    struct task           *task;            /* link to Mach task */
    /* ... */
};

/* Mach side — osfmk/kern/task.h */
struct task {
    queue_head_t           threads;         /* threads belonging to this task */
    vm_map_t               map;             /* address space */
    ipc_space_t            itk_space;       /* Mach port space */
    /* ... */
};

So when fork() creates a process on macOS, a BSD proc and a Mach task are created as a pair. Unix tools (ps, top) look at the proc; Mach-based tools (lldb, Instruments) look at the task.

Process Address-Space Layout

How is a process’s memory laid out? Here’s the classic Unix/Linux 32-bit layout.

Process address space (conceptual) Kernel Space (no direct access from user processes) high address 0xFFFFFFFF Stack call frames, locals ↓ grows downward Unused region room for the stack to grow mmap'd shared libraries sit here (libc, libdl, heap extensions, etc.) Heap allocations via malloc / new ↑ grows upward (brk / sbrk) BSS (Uninitialized Data) int x; (zero-initialized) Data (Initialized) int x = 42; Read-only Data (.rodata) Text (Code) executable machine code low address 0x00400000 Perm RW RW RW RW R RX

A tour of each region (from low address upward):

  • Text (.text): executable machine code. Allowed read + execute only; writes cause a segfault
  • Read-only data (.rodata): string literals ("Hello"), constant arrays. Read-only
  • Data (.data): initialized globals and statics (int x = 42;). The initial values sit in the file
  • BSS (Block Started by Symbol): zero-initialized globals (int x;, static char buf[1024];). The file only records the size; the OS zeroes out the memory at execution — a trick to shrink the binary on disk
  • Heap: dynamic allocations (malloc, new). Grows upward via the brk() syscall
  • Shared library region (mmap): libc.so, libstdc++.so etc. are mapped here via mmap()
  • Stack: call frames, locals, return addresses. Grows downward
  • Kernel space: kernel code and data. User processes have no direct access. On 32-bit Linux it’s the top 1 GB; on x86-64 it’s the top half

Windows uses different section names in PE but the structure is nearly the same (.text, .data, .rdata, .bss).

Process States

A process moves through multiple states. The standard model from Silberschatz:

Process State Transitions (Silberschatz model) New Ready Running Terminated Waiting admitted scheduler dispatch interrupt I/O or event wait I/O or event completion exit
  • New: process just created
  • Ready: runnable but waiting for a CPU
  • Running: actually executing on a CPU
  • Waiting (or Blocked): waiting for I/O or an event
  • Terminated: exited

Real OSes have many more states. Linux’s task_struct has TASK_RUNNING, TASK_INTERRUPTIBLE, TASK_UNINTERRUPTIBLE, TASK_STOPPED, TASK_TRACED, TASK_DEAD, TASK_WAKEKILL, TASK_WAKING, TASK_PARKED, and more. The letters S, R, D, Z you see in ps are these states.

1
2
3
4
5
$ ps aux
USER  PID  %CPU %MEM  COMMAND
root   1   0.0  0.1   /sbin/init           <- S (sleeping)
www    1234 2.1  1.5   nginx: worker        <- R (running)
root   5678 0.0  0.0   [kworker/u8:2]       <- D (uninterruptible sleep)

D state (uninterruptible sleep) matters for game developers too — it means waiting on disk I/O or a driver request, and even kill -9 doesn’t work in this state. A lot of “unresponsive processes” are stuck in D.


Part 2: Process Creation — fork, exec, CreateProcess

Now for how processes are created. This is where the philosophical differences among the three OSes become sharpest.

Unix: fork() + exec() — The Two-Step Model

Unix’s idea is “duplicate the parent, then overwrite.”

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#include <unistd.h>
#include <sys/wait.h>

int main() {
    pid_t pid = fork();   /* step 1: clone yourself */

    if (pid == 0) {
        /* child */
        execl("/bin/ls", "ls", "-l", NULL);   /* step 2: overwrite with a new program */
        /* not reached if execl succeeds */
    } else if (pid > 0) {
        /* parent */
        int status;
        waitpid(pid, &status, 0);             /* wait for the child */
    } else {
        perror("fork failed");
    }
    return 0;
}

A single call to fork() returns twice. It returns the child’s PID to the parent and 0 to the child. An odd API.

What fork() does (naive implementation):

  1. Create a new PCB (task_struct)
  2. Copy the parent’s entire address space (text, data, heap, stack all)
  3. Copy open file descriptors too
  4. Assign a new PID to the child
  5. Put the child on the ready queue

Step 2 is the problem. When a process’s address space is hundreds of MB, copying it every time is hugely expensive. And if exec() is called right after fork(), the address space is overwritten anyway — you copied only to throw away.

Copy-on-Write — “Actually Copy Only On Write”

The answer is Copy-on-Write (COW). At fork() time, only the page tables are copied, and the actual memory pages are shared between parent and child — but marked read-only.

When either side tries to write to a page, the hardware raises a page fault, and only then does the OS copy that one page.

fork() + Copy-on-Write in practice 1) Right after fork() parent page table child page table (copied) physical page (read-only) Only the page table is copied — fast Actual memory stays shared 2) Child tries to write parent page table child attempts write ✍️ physical page (read-only) ⚡ Page fault CPU → OS takes over 3) OS copies the page parent child original (RW restored) copy (child only) Only the written page is copied The rest stays shared Result: fork() is not "copy" but "share + lazy copy" • fork() itself does only page-table-sized work — microseconds • If the child exec()s without writing most pages, copy cost is near zero • Granularity is one page (typically 4KB or 16KB) — one byte written copies the whole page • Linux makes this the default for task creation, so spawning processes is very fast

COW requires hardware support — the CPU’s MMU (Memory Management Unit) must enforce per-page protection and raise page faults, otherwise the OS has no hook to intervene. Page-level MMU is the foundation for nearly every modern-OS trick (COW, swap, mmap, shared memory).

Windows: CreateProcess() — A Single Call

Windows took a different path. There is no parent-cloning concept; it builds a new process from scratch.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#include <windows.h>

int main() {
    STARTUPINFO si = { sizeof(si) };
    PROCESS_INFORMATION pi;

    BOOL ok = CreateProcess(
        "C:\\Windows\\System32\\notepad.exe",  /* executable */
        NULL,                                   /* command line */
        NULL, NULL,                             /* process/thread security */
        FALSE,                                  /* inherit handles? */
        0,                                      /* creation flags */
        NULL, NULL,                             /* environment, working dir */
        &si, &pi);

    if (ok) {
        WaitForSingleObject(pi.hProcess, INFINITE);
        CloseHandle(pi.hProcess);
        CloseHandle(pi.hThread);
    }
    return 0;
}

Unix’s fork() takes no parameters; CreateProcess() takes ten. That’s the Windows philosophy of “stuff every configurable option for process creation into one function.”

Trade-offs:

AspectUnix fork()+exec()Windows CreateProcess()
API complexityTwo steps, each simpleOne step, many parameters
Process creation costVery cheap via COWRelatively expensive
Shell implementationNatural (fork → set up redirections → exec)Needs a separate API like ShellExecute
SecurityParent handles inherit automatically (error-prone)Inheritance is explicit
FlexibilityArbitrary code between fork and execOnly at creation time

macOS — Unix Inheritance Plus a Few Twists

macOS comes from BSD, so naturally it supports fork() and exec(). But XNU’s internal implementation is slightly distinctive.

When BSD’s fork() is mapped down to Mach, what actually happens is:

  1. Clone the current proc struct
  2. Clone the current task at the Mach level (task_create())
  3. Create an initial thread (thread_create())
  4. Clone the address space too (Mach’s vm_map, COW)

That is, a single BSD fork() call decomposes into several Mach-level operations. This is the practical face of the XNU dual structure.

Also interesting is macOS’s posix_spawn(). A POSIX standard that Apple actively promotes, it performs fork+exec in one call.

1
posix_spawn(&pid, "/bin/ls", NULL, NULL, argv, environ);

Why prefer it? Because of iOS. On iOS, fork() is forbidden for security reasons, and only posix_spawn() is allowed. The internal implementation can also be more efficient (it may even skip the COW page-table clone).

Hold on, let’s clarify this

“Why is fork() banned on iOS?”

Three reasons overlap.

  1. Sandbox-escape risk: a forked child inherits its parent’s privileges, and in iOS’s strict app sandbox model this boundary becomes a potential avenue for vulnerabilities
  2. Objective-C runtime state duplication: iOS apps are usually written in Objective-C or Swift, whose runtimes initialize lots of state at startup (threads, GCD queues, IOKit connections, etc.). Post-fork, this state easily falls out of consistency
  3. Memory efficiency: iOS is memory-constrained, and even COW still needs page-table cloning. posix_spawn() can skip even that

On macOS, fork() is still allowed, but Apple recommends posix_spawn() where possible.


Part 3: Threads — Why Processes Alone Aren’t Enough

Limits of Process-Based Concurrency

In 1970s–80s Unix, one process meant one execution flow. To do multiple things at once, you fork()ed multiple processes. A web server would create one process per connection (classic Apache prefork).

Problems with this model:

  1. Process creation cost: cheaper thanks to COW, but still microseconds to milliseconds for page-table cloning, PCB allocation, etc.
  2. Context switch cost: switching between processes also changes the address space, so TLB flushes are required (details below)
  3. IPC cost: since processes have separate address spaces, exchanging data requires heavy machinery like pipes, sockets, or shared memory
  4. Expressing shared state is hard: when multiple flows need to share the same data structure, it gets complicated

By the 1990s a solution was needed, and that was the thread.

Definition of a Thread

A thread is an independent flow of execution within a process. When multiple threads exist in one process, they all share the same address space but can execute simultaneously on CPUs.

What threads share:

  • Text (code): naturally, they execute the same code
  • Heap: memory allocated with malloc
  • Data / BSS: globals and statics
  • Open file descriptors
  • Signal handlers

What threads keep private:

  • Stack: each thread has its own
  • CPU register state: PC, SP, general-purpose regs
  • TLS (Thread-Local Storage): per-thread globals
  • Error state: errno (in POSIX it’s per-thread)
Memory sharing — processes vs threads Multiple processes — fully separated Process A Text (code) Data / BSS Heap Stack (flow 1) Registers, PC File descriptors Process B Text (separate) Data / BSS Heap Stack (flow 1) Registers, PC File descriptors No communication without IPC (pipes/sockets/shared memory) Multiple threads in one process — mostly shared Process C (3 threads) Text (shared) Data / BSS (shared) Heap (shared) Stack T1 private Stack T2 private Stack T3 private Regs T1 Regs T2 Regs T3 File descriptors (shared) TLS T1 TLS T2 TLS T3 Same heap/data read and written directly — the root of races

Key takeaways from this diagram:

  1. Threads share heap and globals by default — “shared memory” exists naturally
  2. So two threads both doing counter++ on the same int counter creates a race condition
  3. Two processes, by contrast, are naturally isolated because their address spaces are separate

The answer to Stage 2’s key question — “why does the program crash only sometimes when two threads use the same variable?” — is hidden in this diagram. Threads intentionally share memory, so concurrency issues arise, and they need synchronization techniques to manage them. (We cover that thoroughly in Part 10: Synchronization Primitives.)

TCB — The Thread Control Block

Just as processes have PCBs, threads have TCBs (Thread Control Blocks). A TCB holds:

  • Thread ID
  • CPU register state (saved context)
  • Thread state (Running, Ready, Waiting)
  • Stack pointer, stack base
  • Scheduling info (priority etc.)
  • Pointer to the owning process

Per-OS implementation:

  • Linux: task_struct — processes and threads use the same struct, distinguished by which fields they share
  • Windows: KTHREAD + ETHREAD
  • macOS: Mach’s struct thread

Linux’s Peculiar Philosophy — “Processes and Threads Are the Same”

Linus Torvalds made a bold decision in the 1990s: “Don’t make processes and threads separate concepts; unify them as a single ‘execution unit.’“

In Linux, instead of fork(), there’s the more general clone() syscall. clone() specifies “what to share with the parent” as a bit flag.

1
2
3
4
5
6
7
8
9
10
/* Linux clone() — concept */
clone(fn, stack, flags, arg);

/* Example flags: */
CLONE_VM       /* share address space (true → thread, false → process) */
CLONE_FS       /* share filesystem state */
CLONE_FILES    /* share file descriptors */
CLONE_SIGHAND  /* share signal handlers */
CLONE_THREAD   /* same thread group */
/* ... */
  • fork() = clone() with all share flags OFF
  • pthread_create() = clone() with all share flags ON
  • Any combination in between is possible

That’s Linux’s “process and thread are on a continuum” worldview. Android for example uses “partially sharing” process clones in practice (the Zygote process).

TLS — Thread-Local Storage

Sometimes you need variables that look global but are actually independent per thread. That’s TLS.

The canonical example: errno. In POSIX, errno is “the error code of the last syscall,” but it must be per-thread (thread A’s failed read() must not be overwritten by thread B). So errno is implemented as TLS.

TLS declarations by language:

1
2
3
4
5
/* C11 */
_Thread_local int counter = 0;

/* GCC/Clang extension */
__thread int counter = 0;
1
2
// C++11
thread_local int counter = 0;
1
2
3
4
5
6
// C#
[ThreadStatic]
static int counter;

// Or the more flexible ThreadLocal<T>
static ThreadLocal<int> counter = new ThreadLocal<int>(() => 0);

Practical uses in game development:

  • Logging systems use TLS to store each thread’s name for inclusion in log lines
  • Rendering assigns per-thread command buffers that are merged later
  • Profilers track the current scope stack per thread

Part 4: Thread Models — 1:1, N:1, M:N

A deeper question: when you call pthread_create() or new Thread(), how does the kernel manage that thread?

Why the Question Matters

The unit that actually runs on a CPU is a kernel-level thread (KLT). Only the kernel schedules the CPU.

In contrast, the “thread” your program creates can be just a user-space abstraction. That’s called a user-level thread (ULT).

The mapping between user threads and kernel threads falls into three categories.

User threads ↔ Kernel threads mapping models 1:1 (one-to-one) Linux NPTL, Windows ULT 1 ULT 2 ULT 3 KLT 1 KLT 2 KLT 3 Pros • Simple to implement • True parallelism (multicore) • Uses the kernel scheduler Cons • Thread creation is costly • Thousands exhaust kernel resources • Context switches are heavy N:1 (many-to-one) old green threads, GNU Pth ULT 1 ULT 2 ULT 3 ULT 4 1 KLT Pros • Extremely cheap creation • Supports hundreds of thousands • User-level scheduler freedom Cons • No parallelism (one core only) • Blocking syscalls stop everyone • Rarely used today M:N (many-to-many) Go, Erlang, old Solaris U1 U2 U3 U4 U5 KLT 1 KLT 2 KLT 3 Pros • Cheap threads + parallelism • Best of both worlds • Millions of goroutines Cons • Complex runtime • Scheduling fairness issues • Harder to debug

1:1 Model — The Current Linux/Windows Choice

In 1:1, each user-created thread maps to exactly one kernel thread. pthread_create() internally calls the clone() syscall and directly creates a kernel-managed task.

Linux NPTL (Native POSIX Thread Library): Since Linux 2.6, the glibc pthread implementation uses NPTL, a 1:1 model. Before that there was LinuxThreads, a nonstandard 1:1 implementation, which NPTL replaced on POSIX compliance and performance grounds.

Windows: CreateThread() creates a KTHREAD directly in the kernel. 1:1 again.

Pros: if one thread blocks, others keep running. Natural distribution across cores.

Cons: thread creation is relatively expensive; tens of thousands or more strain kernel memory.

N:1 Model — A Legacy

In N:1, multiple user threads map to one kernel thread. The kernel doesn’t know the process has multiple threads — it sees just one process.

This model was used in early Java “green threads,” in GNU Pth, and others. It was standard in the early 1990s but fatal drawbacks nearly drove it extinct:

  • A blocking syscall stops everyone: one user thread blocking in read() freezes all others sharing the kernel thread
  • No multicore use: one kernel thread lives on one core

M:N Model — Go’s Choice

M:N combines the two. M user threads map dynamically to a pool of N kernel threads (typically N = number of CPU cores).

Representative implementations:

  • Go goroutines: the Go runtime has an M:N scheduler, running millions of goroutines over a handful of OS threads
  • Erlang/Elixir: the BEAM VM implements its own scheduler
  • Old Solaris (Solaris 2–8): implemented POSIX pthreads M:N, but Solaris 9 switched to 1:1 for complexity reasons

Theoretical grounding — Anderson et al.’s SOSP 1991 Scheduler Activations paper tackles “what kernel support is needed so a user-level thread library can efficiently implement M:N.” The key is that on blocking syscalls the kernel should wake the user scheduler so it can assign another user thread to another kernel thread.

The Go runtime implements a similar idea. When a goroutine tries a blocking syscall, the runtime detects it and either migrates that goroutine to another kernel thread or spawns a new one. So one net.Listen blocking doesn’t affect other goroutines.

From a Game Development Perspective

The threads used in Unity and Unreal are 1:1 at the C++/C# layer. new Thread() or std::thread creates kernel threads directly.

However, engines’ internal Job systems or Task graphs are effectively M:N schedulers. The programmer can queue thousands of “Jobs,” yet they run on the engine’s handful of worker threads. This ties directly into the Unity Job System design we address in detail in Part 13 (Lock-free and Structural Solutions).


Part 5: Three-OS Thread APIs Compared

Linux — pthreads

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#include <pthread.h>
#include <stdio.h>

void* worker(void* arg) {
    int id = *(int*)arg;
    printf("Thread %d running\n", id);
    return NULL;
}

int main() {
    pthread_t t1, t2;
    int id1 = 1, id2 = 2;

    pthread_create(&t1, NULL, worker, &id1);
    pthread_create(&t2, NULL, worker, &id2);

    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
    return 0;
}

POSIX-standard API. Internally it calls the clone() syscall. Officially it’s “pthread,” but Linux man pages are really documenting NPTL (the glibc implementation).

Windows — CreateThread / _beginthreadex

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#include <windows.h>
#include <process.h>

unsigned __stdcall worker(void* arg) {
    int id = *(int*)arg;
    printf("Thread %d running\n", id);
    return 0;
}

int main() {
    HANDLE t1, t2;
    int id1 = 1, id2 = 2;

    t1 = (HANDLE)_beginthreadex(NULL, 0, worker, &id1, 0, NULL);
    t2 = (HANDLE)_beginthreadex(NULL, 0, worker, &id2, 0, NULL);

    WaitForSingleObject(t1, INFINITE);
    WaitForSingleObject(t2, INFINITE);
    CloseHandle(t1);
    CloseHandle(t2);
    return 0;
}

Why not CreateThread? CreateThread skips CRT (C Runtime Library) initialization — so thread-local state like errno and strtok isn’t set up, causing subtle bugs. _beginthreadex initializes the CRT, so for C/C++ code you should use it.

macOS — pthreads + libdispatch

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
/* POSIX style — same as Linux */
#include <pthread.h>
/* ... */

/* libdispatch (GCD) style — Apple's preferred */
#include <dispatch/dispatch.h>

int main() {
    dispatch_async(dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0), ^{
        printf("Running in background\n");
        dispatch_async(dispatch_get_main_queue(), ^{
            printf("Back to main thread\n");
        });
    });

    dispatch_main();
    return 0;
}

macOS supports pthreads too, but Apple recommends GCD (Grand Central Dispatch). We covered the rationale in Part 7 — no manual thread lifetime management, QoS-based routing to P/E cores, a predictable queue abstraction.

C# — Language-Level Abstraction

C# works on all three OSes. The .NET runtime (CLR or CoreCLR) hides OS differences.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
using System;
using System.Threading;
using System.Threading.Tasks;

// 1) Most primitive — rarely used today
Thread t = new Thread(() => Console.WriteLine("Hello"));
t.Start();
t.Join();

// 2) ThreadPool — thread reuse
ThreadPool.QueueUserWorkItem(_ => Console.WriteLine("Hello"));

// 3) Task / async-await — modern default
await Task.Run(() => HeavyComputation());

// 4) Parallel — data parallelism
Parallel.For(0, 100, i => ProcessItem(i));

Underneath:

  • Linux: libcoreclr uses pthread_create()
  • Windows: uses CreateThread()
  • macOS: uses pthread_create() (doesn’t use GCD directly)

Unity’s quirk: Unity discourages Thread usage. It nudges you to Job System, UniTask, and coroutines instead. Because most Unity Engine APIs crash if called outside the main thread. (See Part 13 for details.)


Part 6: Context Switching — Why It’s Expensive

What a Context Switch Is

To run multiple threads alternately on one CPU core, the OS saves the current thread’s state and restores the next thread’s state. That’s context switching.

What must be saved:

  • CPU registers: RAX, RBX, …, RIP (program counter), RSP (stack pointer), flags
  • Floating-point registers: XMM, YMM, ZMM (tens of KB in the AVX era)
  • MMU state: on a process switch, the page table pointer (CR3 on x86) changes too

The “Hidden Cost” of Context Switching

Saving and restoring registers is just the tip of the iceberg. The real cost is indirect effects.

Context switch — direct vs hidden costs Thread A running switch ~1-10μs Thread B (cache rebuild) switch ~1-10μs Thread A (cache rebuild) Direct cost (visible) • Register save (~30 regs, hundreds of bytes) • SIMD regs (several KB with AVX-512) • Enter kernel → run scheduler → return • MMU pointer swap (on process switch) Typically 1–10 microseconds (hardware-dependent) Hidden cost (invisible) TLB flush: address translation cache invalidated → hundreds to thousands of cycles to rebuild CPU cache pollution: Thread A's L1/L2 data is evicted by Thread B's execution Branch predictor pollution: history mingled Prefetcher state reset Total tens of microseconds to milliseconds of "post-switch slowdown" Conclusion: too many threads and the CPU just context-switches without real work Mitigations: (1) keep thread count near core count, (2) queue fine-grained Jobs/Tasks

TLBs and Process-to-Process Switches

The TLB (Translation Lookaside Buffer) is a small CPU cache that stores “virtual address → physical address” lookups. Typical L1 TLBs have 64–128 entries.

When a process switch occurs, the CR3 register (the page-table base) changes, and the TLB is fully flushed (absent PCID/ASID optimizations). Every subsequent memory access then has to walk the page tables again.

Thread-to-thread switches are cheaper — threads share an address space, so CR3 doesn’t change and the TLB isn’t flushed. That’s one concrete reason “threads are lighter than processes.”

Measuring It

On Linux you can measure with perf stat:

1
2
3
4
5
6
7
$ perf stat -e context-switches,cpu-migrations,cache-misses -p <PID> sleep 10

Performance counter stats for process id '1234':

     12,345      context-switches
        567      cpu-migrations
 10,234,567      cache-misses

On macOS, Instruments’s System Trace template lets you observe thread scheduling and context switches at microsecond resolution.

On Windows, Xperf and Windows Performance Analyzer fill the same role.

A LaMarca & Ladner Observation

As LaMarca & Ladner 1996, “The Influence of Caches on the Performance of Heaps” argues, theoretical asymptotic complexity alone cannot predict real performance. By the same token, the naive expectation that “more threads = faster” breaks down because of cache/TLB costs.

The rule “optimal thread count = core count” comes from this observation. Beyond that, context switching eats the gains.


Part 7: Game Engine Execution Models

Now we link theory to game engines.

Unity — The Hard Main-Thread Constraint

If you’ve used Unity, you’ve likely seen the warning “this API can only be called on the main thread.” Most Unity Engine APIs — Transform.position, GameObject.Instantiate(), Renderer.sharedMaterial, etc. — are main-thread only.

Why?

The Unity Engine is written in C++ and its internal data structures have no locks. Unity’s team assumed “all engine calls come from the main thread” by design, eliminating lock-acquisition overhead.

This is a deliberate trade-off:

  • ✅ Engine calls are very fast (no locks)
  • ❌ Multithreaded use is awkward

Unity’s answer: Job System + Burst + Native Containers. Leave the main thread alone and provide a separate layer that parallelizes only the data processing. (Details in Part 13.)

Unreal Engine — The Task Graph

Unreal Engine uses a Task Graph system. “Tasks” submitted by game code form a dependency DAG that the engine spreads across a worker thread pool.

Unreal’s worker threads:

  • Game Thread: game logic (Unity’s main thread equivalent)
  • Render Thread: build rendering commands
  • RHI Thread: GPU driver calls
  • Worker Threads: general-purpose work

Tasks specify their target via ENamedThreads. Examples: ENamedThreads::GameThread, ENamedThreads::AnyBackgroundHiPriTask.

Fiber — Naughty Dog’s Approach

Christian Gyrling’s GDC 2015 talk “Parallelizing the Naughty Dog Engine Using Fibers” is famous for its fiber-based engine design.

A fiber is a cooperative user-level thread. The OS isn’t involved; the application switches them itself. If kernel threads are workers, fibers are the tasks the workers are carrying at the moment.

  • Fiber creation cost: extremely cheap (nanoseconds)
  • Fiber switch: save/restore registers only, no kernel involvement
  • Can dispatch thousands

Naughty Dog’s The Last of Us Part II used this system to reliably exploit the PS4’s 7 cores. Fibers can be viewed as one form of the M:N model (fibers = user threads, kernel threads = workers).

Windows fiber API: CreateFiber, SwitchToFiber. On macOS/Linux you’d use ucontext.h’s makecontext/swapcontext (legacy, discouraged) or libraries like Boost.Context and libco.

Engine Execution Models Compared

Execution models of major engines Unity Main Thread (fixed) Most Engine APIs Transform, GameObject, etc. Job System (separate layer) W0 W1 W2 W..N IJob / IJobParallelFor Burst + Native Containers Idea: keep engine + parallelize data Unreal Engine Game Thread Render Thread RHI Thread Audio Thread Worker Thread Pool W0 W1 W2 W..N Task Graph: dependency DAG ENamedThreads picks target Idea: multiple named threads + pool Fiber (Naughty Dog) Worker threads (core count) W0 W1 W2 W..7 Fiber pool (thousands) F F F F F ... Job = run on a fiber Cooperative switch, no kernel Swap fibers on wait Idea: cooperative user-level scheduling

Part 8: Hands-On — How Are My Threads Actually Running?

Once you know the theory, it’s time to actually look. All three OSes ship rich tools for observing processes and threads.

Linux — /proc, ps, top

On Linux everything is exposed in the /proc virtual filesystem.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# List threads of a specific process
$ ls /proc/<PID>/task/
1234  1235  1236  ...

# State of each thread
$ cat /proc/1234/task/1234/status
Name:   myapp
State:  R (running)
Tgid:   1234
Pid:    1234
Threads: 8

# Address-space mappings
$ cat /proc/1234/maps
00400000-00452000 r-xp 00000000 08:01 12345 /usr/bin/myapp
00651000-00652000 r--p 00051000 08:01 12345 /usr/bin/myapp
7f1234000000-7f1234021000 r-xp 00000000 08:01 54321 /lib/x86_64-linux-gnu/libc.so.6
...

top -H shows per-thread CPU usage.

macOS — Activity Monitor, ps, Instruments

Activity Monitor is the GUI tool, but more precise data lives in CLI tools.

1
2
3
4
5
# Show thread count for a process
$ ps -M <PID>

# Detailed info
$ sample <PID> 5 -mayDie

The most powerful option is Instruments’ System Trace template. It shows a per-P/E-core execution timeline, context-switch events, and blocking causes. It’s especially useful on Apple Silicon — visualizing which threads ran on P-cores and which were pushed to E-cores.

Windows — Process Explorer, WPA

Process Explorer (Sysinternals) is a beefed-up Task Manager:

  • Visualized process tree
  • Thread list per process with stack traces
  • Handles, DLLs, memory details

Windows Performance Analyzer (WPA) is the Instruments equivalent, analyzing ETW events collected via Xperf.

Threads in C# — Code Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
using System;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;

class ThreadInspector {
    static void Main() {
        Console.WriteLine($"Current process ID: {Process.GetCurrentProcess().Id}");
        Console.WriteLine($"Managed thread ID: {Thread.CurrentThread.ManagedThreadId}");
        Console.WriteLine($"CPU core count: {Environment.ProcessorCount}");

        // Measure thread-creation cost
        var sw = Stopwatch.StartNew();
        var threads = new Thread[100];
        for (int i = 0; i < 100; i++) {
            threads[i] = new Thread(() => Thread.Sleep(1));
            threads[i].Start();
        }
        foreach (var t in threads) t.Join();
        sw.Stop();
        Console.WriteLine($"100 thread create+join: {sw.ElapsedMilliseconds}ms");

        // ThreadPool.QueueUserWorkItem is much faster
        sw.Restart();
        var countdown = new CountdownEvent(100);
        for (int i = 0; i < 100; i++) {
            ThreadPool.QueueUserWorkItem(_ => {
                Thread.Sleep(1);
                countdown.Signal();
            });
        }
        countdown.Wait();
        sw.Stop();
        Console.WriteLine($"100 ThreadPool items: {sw.ElapsedMilliseconds}ms");
    }
}

Run output (approximate on my machine):

1
2
3
4
5
Current process ID: 12345
Managed thread ID: 1
CPU core count: 8
100 thread create+join: 85ms
100 ThreadPool items: 8ms

10× difference. That’s the practical case for thread pools. .NET’s ThreadPool, Java’s ExecutorService, C++’s std::async all share the idea — amortize creation cost by reusing threads.


Wrap-up

What this post covered:

Processes:

  • PCB (task_struct, EPROCESS, proc+task) — how the OS tracks a process
  • Address space layout: Text, Data, BSS, Heap, Stack, Kernel
  • State transitions: New, Ready, Running, Waiting, Terminated

Process creation:

  • Unix fork() + exec() — two steps, Copy-on-Write keeps it fast
  • Windows CreateProcess() — one step, many parameters
  • macOS posix_spawn() — iOS-compatible, more efficient
  • COW in fork() relies on hardware MMU support

Threads:

  • Process vs thread: whether the address space is shared is the key
  • Shared: text, data, heap, file descriptors
  • Private: stack, registers, TLS
  • Linux’s peculiar philosophy: same struct for process and thread (clone())

Thread mapping models:

  • 1:1 (Linux NPTL, Windows): standard, true parallelism
  • N:1 (old green threads): nearly obsolete
  • M:N (Go goroutines, Erlang): millions of concurrent threads, complex runtime

Context switching:

  • Direct cost: register save/restore ~1–10μs
  • Hidden cost: TLB flush, cache pollution, branch predictor pollution
  • Process switches are more expensive than thread switches (CR3 change)
  • “Thread count = core count” rule

Game engine execution models:

  • Unity: main thread constraint + Job System (data parallelism)
  • Unreal: multiple named threads + Task Graph
  • Naughty Dog engine: fiber-based cooperative scheduling

The next post is Part 9 Scheduling — when several threads are all Ready, who does the OS hand the CPU to? We look at Linux’s CFS → EEVDF, Windows’ priority boost, and macOS’s QoS-based scheduling. We also cover the 16.67 ms game frame budget and the priority-inversion problem.


References

Textbooks

  • Silberschatz, Galvin, Gagne — Operating System Concepts, 10th ed., Wiley, 2018 — Ch.3 (Processes), Ch.4 (Threads)
  • Bovet, Cesati — Understanding the Linux Kernel, 3rd ed., O’Reilly, 2005 — task_struct and process management Ch.3
  • Mauerer — Professional Linux Kernel Architecture, Wrox, 2008 — modern Linux kernel internals
  • Russinovich, Solomon, Ionescu — Windows Internals, 7th ed., Microsoft Press, 2017 — EPROCESS/ETHREAD details
  • Singh — Mac OS X Internals: A Systems Approach, Addison-Wesley, 2006 — XNU’s task/proc dual structure
  • Butenhof — Programming with POSIX Threads, Addison-Wesley, 1997 — the classic pthreads reference
  • Stevens, Rago — Advanced Programming in the UNIX Environment, 3rd ed., Addison-Wesley, 2013 — fork/exec in practice
  • Gregory — Game Engine Architecture, 3rd ed., CRC Press, 2018 — Ch.8 multiprocessor engine design

Papers

  • Anderson, Bershad, Lazowska, Levy — “Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism”, SOSP 1991 — DOI — theoretical basis for the M:N model
  • Mogul, Borg — “The Effect of Context Switches on Cache Performance”, ASPLOS 1991 — measuring the hidden cost of context switches
  • Engelschall — “Portable Multithreading: The Signal Stack Trick for User-Space Thread Creation”, USENIX 2000 — implementing user-level threads
  • Kleiman, Smaalders — “The LWP Framework: Building and Debugging Mach Tasks and Threads”, Mach Workshop 1990 — Mach’s thread model

Official Docs

Game Development / GDC Resources

Blogs / Articles

  • Raymond Chen — The Old New Thing — internals of Win32 CreateProcess
  • Linus Torvalds — early comp.os.minix thread-related discussions (1992)
  • Dmitry Vyukov — 1024cores.net — lock-free concurrency reference (including Go scheduler internals)
  • Howard Oakley — The Eclectic Light Company — macOS thread-observability techniques

Tools

  • Linux: ps, top, htop, strace, perf, ftrace
  • macOS: Activity Monitor, ps, sample, Instruments (System Trace, Time Profiler)
  • Windows: Task Manager, Process Explorer, WPA, PerfView
  • Cross-platform: Tracy Profiler — great for embedding in games
This post is licensed under CC BY 4.0 by the author.