CS Roadmap Part 8 — Processes and Threads: How the OS Abstracts Execution Units

Posted Apr 26, 2026

Intermediate

By Sehyup

33 min read

CS Roadmap Part 8 — Processes and Threads: How the OS Abstracts Execution Units

Prerequisites — Read these first

CS Roadmap Part 7 — OS Architect... CS Roadmap Part 9 — Scheduling: ...

TL;DR — Key Takeaways

A process is "an isolated address space plus a bundle of resources," and a thread is "a flow of execution inside a process." Threads share code, heap, and globals but keep the stack and registers private
Unix's fork() clones the process and is then overwritten by exec() in two steps, while Windows' CreateProcess() builds a new one in one shot. The clone looks expensive, but Copy-on-Write makes it actually fast
Thread models split into 1:1 (Linux NPTL, Windows), N:1 (green threads), M:N (Go goroutines, Erlang) — each trades performance against implementation complexity differently
A context switch doesn't just save/restore registers; it also causes TLB flushes and cache pollution. Modern game engines therefore move toward "break work into Jobs/TaskGraph/Fibers and distribute over cores" rather than "spawn more threads"

Introduction: From the Map to the Body

The previous post surveyed the lineage and skeleton of the three operating systems: Linux monolithic, Windows NT hybrid, macOS XNU a Mach + BSD dual structure. If that was the map, this post is the body of the journey.

Let’s bring Stage 2’s key question back.

“When two threads use the same variable, why does the program crash only sometimes?”

To answer this, we first need to know “what a thread is” precisely. And to understand threads, we need to first understand their parent concept, the process. The distinction between processes and threads, how the two share and separate memory, and how the OS abstracts all of this — these are the starting points of every concurrency problem.

What we cover in this post:

Processes: PCBs and address-space layouts. Linux’s task_struct, Windows’ EPROCESS, macOS’s proc/task
Process creation: Unix’s two-step fork()+exec() model, Windows’ single-call CreateProcess(), and Copy-on-Write
Threads: why processes alone aren’t enough, TCBs, shared vs. private regions, TLS
Thread mapping models: 1:1, N:1, M:N — why Go’s goroutines are so cheap
Context switching: the real cost of registers, TLBs, and caches
Game engine execution models: Unity’s main thread, Unreal’s TaskGraph, Naughty Dog’s fibers

We keep the game-dev lens throughout, but this post carries more foundational theory than previous ones. The next post (scheduling) and the one after (synchronization) are built on top of it.

Part 1: Processes — The Execution Unit the OS Sees

What Is a Process?

Start with the textbook definition. A process is a program in execution. The .exe file on disk or the Mach-O binary is a program; the instance of it loaded into memory and running on the CPU is a process.

A process owns:

A unique address space — memory isolated from other processes
Execution state — CPU register values, program counter
An open-files table — the list of file descriptors currently in use
Ownership info — UID, GID, permissions
Parent-child relationships — who created whom (the process tree)

The OS manages all of this via a single struct. That’s the PCB (Process Control Block), also called the process descriptor.

The PCB in Practice — Per-OS Structs

Linux — task_struct

In the Linux kernel, processes (and threads) are represented by struct task_struct. It’s defined in include/linux/sched.h and is a huge struct with hundreds of fields.

  
/* Linux kernel task_struct (kernel 6.x, heavily simplified) */
struct task_struct {
    /* State */
    unsigned int           __state;          /* TASK_RUNNING etc. */

    /* Identifiers */
    pid_t                  pid;              /* process id */
    pid_t                  tgid;             /* thread group id */
    struct task_struct    *parent;           /* parent process */
    struct list_head       children;         /* child list */

    /* Memory */
    struct mm_struct      *mm;               /* address space */

    /* Files */
    struct files_struct   *files;            /* open files table */

    /* Scheduling */
    int                    prio;
    struct sched_entity    se;               /* CFS scheduling entity */

    /* signals, resource limits, and hundreds more... */
};

The real struct is over 700 lines. Crucially, in Linux a process and a thread share the same struct. We come back to this peculiarity later.

Windows — EPROCESS, KPROCESS

Windows NT splits across two layers:

KPROCESS (Kernel Process Block) — minimal scheduling-related info
EPROCESS (Executive Process Block) — wraps KPROCESS and adds more

  
/* Conceptual pseudocode — see WinDbg or leaked NT sources for the real thing */
typedef struct _EPROCESS {
    KPROCESS Pcb;                    /* kernel process block (inherited) */
    HANDLE UniqueProcessId;          /* PID */
    LIST_ENTRY ActiveProcessLinks;   /* global process list */
    PVOID SectionBaseAddress;        /* image load address */
    PVOID Token;                     /* security token */
    /* ... */
} EPROCESS;

macOS — proc + task

macOS’s dual structure shows up here too. The BSD layer holds the classic Unix struct proc; the Mach layer holds struct task.

  
/* BSD side — bsd/sys/proc_internal.h */
struct proc {
    pid_t                  p_pid;           /* POSIX process ID */
    struct proc           *p_pptr;          /* parent */
    struct task           *task;            /* link to Mach task */
    /* ... */
};

/* Mach side — osfmk/kern/task.h */
struct task {
    queue_head_t           threads;         /* threads belonging to this task */
    vm_map_t               map;             /* address space */
    ipc_space_t            itk_space;       /* Mach port space */
    /* ... */
};

So when fork() creates a process on macOS, a BSD proc and a Mach task are created as a pair. Unix tools (ps, top) look at the proc; Mach-based tools (lldb, Instruments) look at the task.

Process Address-Space Layout

How is a process’s memory laid out? Here’s the classic Unix/Linux 32-bit layout.

A tour of each region (from low address upward):

Text (.text): executable machine code. Allowed read + execute only; writes cause a segfault
Read-only data (.rodata): string literals ("Hello"), constant arrays. Read-only
Data (.data): initialized globals and statics (int x = 42;). The initial values sit in the file
BSS (Block Started by Symbol): zero-initialized globals (int x;, static char buf[1024];). The file only records the size; the OS zeroes out the memory at execution — a trick to shrink the binary on disk
Heap: dynamic allocations (malloc, new). Grows upward via the brk() syscall
Shared library region (mmap): libc.so, libstdc++.so etc. are mapped here via mmap()
Stack: call frames, locals, return addresses. Grows downward
Kernel space: kernel code and data. User processes have no direct access. On 32-bit Linux it’s the top 1 GB; on x86-64 it’s the top half

Windows uses different section names in PE but the structure is nearly the same (.text, .data, .rdata, .bss).

Process States

A process moves through multiple states. The standard model from Silberschatz:

New: process just created
Ready: runnable but waiting for a CPU
Running: actually executing on a CPU
Waiting (or Blocked): waiting for I/O or an event
Terminated: exited

Real OSes have many more states. Linux’s task_struct has TASK_RUNNING, TASK_INTERRUPTIBLE, TASK_UNINTERRUPTIBLE, TASK_STOPPED, TASK_TRACED, TASK_DEAD, TASK_WAKEKILL, TASK_WAKING, TASK_PARKED, and more. The letters S, R, D, Z you see in ps are these states.

  
$ ps aux
USER  PID  %CPU %MEM  COMMAND
root   1   0.0  0.1   /sbin/init           <- S (sleeping)
www    1234 2.1  1.5   nginx: worker        <- R (running)
root   5678 0.0  0.0   [kworker/u8:2]       <- D (uninterruptible sleep)

D state (uninterruptible sleep) matters for game developers too — it means waiting on disk I/O or a driver request, and even kill -9 doesn’t work in this state. A lot of “unresponsive processes” are stuck in D.

Part 2: Process Creation — fork, exec, CreateProcess

Now for how processes are created. This is where the philosophical differences among the three OSes become sharpest.

Unix: fork() + exec() — The Two-Step Model

Unix’s idea is “duplicate the parent, then overwrite.”

  
#include <unistd.h>
#include <sys/wait.h>

int main() {
    pid_t pid = fork();   /* step 1: clone yourself */

    if (pid == 0) {
        /* child */
        execl("/bin/ls", "ls", "-l", NULL);   /* step 2: overwrite with a new program */
        /* not reached if execl succeeds */
    } else if (pid > 0) {
        /* parent */
        int status;
        waitpid(pid, &status, 0);             /* wait for the child */
    } else {
        perror("fork failed");
    }
    return 0;
}

A single call to fork() returns twice. It returns the child’s PID to the parent and 0 to the child. An odd API.

What fork() does (naive implementation):

Create a new PCB (task_struct)
Copy the parent’s entire address space (text, data, heap, stack all)
Copy open file descriptors too
Assign a new PID to the child
Put the child on the ready queue

Step 2 is the problem. When a process’s address space is hundreds of MB, copying it every time is hugely expensive. And if exec() is called right after fork(), the address space is overwritten anyway — you copied only to throw away.

Copy-on-Write — “Actually Copy Only On Write”

The answer is Copy-on-Write (COW). At fork() time, only the page tables are copied, and the actual memory pages are shared between parent and child — but marked read-only.

When either side tries to write to a page, the hardware raises a page fault, and only then does the OS copy that one page.

COW requires hardware support — the CPU’s MMU (Memory Management Unit) must enforce per-page protection and raise page faults, otherwise the OS has no hook to intervene. Page-level MMU is the foundation for nearly every modern-OS trick (COW, swap, mmap, shared memory).

Windows: CreateProcess() — A Single Call

Windows took a different path. There is no parent-cloning concept; it builds a new process from scratch.

  
#include <windows.h>

int main() {
    STARTUPINFO si = { sizeof(si) };
    PROCESS_INFORMATION pi;

    BOOL ok = CreateProcess(
        "C:\\Windows\\System32\\notepad.exe",  /* executable */
        NULL,                                   /* command line */
        NULL, NULL,                             /* process/thread security */
        FALSE,                                  /* inherit handles? */
        0,                                      /* creation flags */
        NULL, NULL,                             /* environment, working dir */
        &si, &pi);

    if (ok) {
        WaitForSingleObject(pi.hProcess, INFINITE);
        CloseHandle(pi.hProcess);
        CloseHandle(pi.hThread);
    }
    return 0;
}

Unix’s fork() takes no parameters; CreateProcess() takes ten. That’s the Windows philosophy of “stuff every configurable option for process creation into one function.”

Trade-offs:

Aspect	Unix `fork()+exec()`	Windows `CreateProcess()`
API complexity	Two steps, each simple	One step, many parameters
Process creation cost	Very cheap via COW	Relatively expensive
Shell implementation	Natural (fork → set up redirections → exec)	Needs a separate API like ShellExecute
Security	Parent handles inherit automatically (error-prone)	Inheritance is explicit
Flexibility	Arbitrary code between fork and exec	Only at creation time

macOS — Unix Inheritance Plus a Few Twists

macOS comes from BSD, so naturally it supports fork() and exec(). But XNU’s internal implementation is slightly distinctive.

When BSD’s fork() is mapped down to Mach, what actually happens is:

Clone the current proc struct
Clone the current task at the Mach level (task_create())
Create an initial thread (thread_create())
Clone the address space too (Mach’s vm_map, COW)

That is, a single BSD fork() call decomposes into several Mach-level operations. This is the practical face of the XNU dual structure.

Also interesting is macOS’s posix_spawn(). A POSIX standard that Apple actively promotes, it performs fork+exec in one call.

  
posix_spawn(&pid, "/bin/ls", NULL, NULL, argv, environ);

Why prefer it? Because of iOS. On iOS, fork() is forbidden for security reasons, and only posix_spawn() is allowed. The internal implementation can also be more efficient (it may even skip the COW page-table clone).

Hold on, let’s clarify this
“Why is fork() banned on iOS?”
Three reasons overlap.
Sandbox-escape risk: a forked child inherits its parent’s privileges, and in iOS’s strict app sandbox model this boundary becomes a potential avenue for vulnerabilities
Objective-C runtime state duplication: iOS apps are usually written in Objective-C or Swift, whose runtimes initialize lots of state at startup (threads, GCD queues, IOKit connections, etc.). Post-fork, this state easily falls out of consistency
Memory efficiency: iOS is memory-constrained, and even COW still needs page-table cloning. posix_spawn() can skip even that
On macOS, fork() is still allowed, but Apple recommends posix_spawn() where possible.

Part 3: Threads — Why Processes Alone Aren’t Enough

Limits of Process-Based Concurrency

In 1970s–80s Unix, one process meant one execution flow. To do multiple things at once, you fork()ed multiple processes. A web server would create one process per connection (classic Apache prefork).

Problems with this model:

Process creation cost: cheaper thanks to COW, but still microseconds to milliseconds for page-table cloning, PCB allocation, etc.
Context switch cost: switching between processes also changes the address space, so TLB flushes are required (details below)
IPC cost: since processes have separate address spaces, exchanging data requires heavy machinery like pipes, sockets, or shared memory
Expressing shared state is hard: when multiple flows need to share the same data structure, it gets complicated

By the 1990s a solution was needed, and that was the thread.

Definition of a Thread

A thread is an independent flow of execution within a process. When multiple threads exist in one process, they all share the same address space but can execute simultaneously on CPUs.

What threads share:

Text (code): naturally, they execute the same code
Heap: memory allocated with malloc
Data / BSS: globals and statics
Open file descriptors
Signal handlers

What threads keep private:

Stack: each thread has its own
CPU register state: PC, SP, general-purpose regs
TLS (Thread-Local Storage): per-thread globals
Error state: errno (in POSIX it’s per-thread)

Key takeaways from this diagram:

Threads share heap and globals by default — “shared memory” exists naturally
So two threads both doing counter++ on the same int counter creates a race condition
Two processes, by contrast, are naturally isolated because their address spaces are separate

The answer to Stage 2’s key question — “why does the program crash only sometimes when two threads use the same variable?” — is hidden in this diagram. Threads intentionally share memory, so concurrency issues arise, and they need synchronization techniques to manage them. (We cover that thoroughly in Part 10: Synchronization Primitives.)

TCB — The Thread Control Block

Just as processes have PCBs, threads have TCBs (Thread Control Blocks). A TCB holds:

Thread ID
CPU register state (saved context)
Thread state (Running, Ready, Waiting)
Stack pointer, stack base
Scheduling info (priority etc.)
Pointer to the owning process

Per-OS implementation:

Linux: task_struct — processes and threads use the same struct, distinguished by which fields they share
Windows: KTHREAD + ETHREAD
macOS: Mach’s struct thread

Linux’s Peculiar Philosophy — “Processes and Threads Are the Same”

Linus Torvalds made a bold decision in the 1990s: “Don’t make processes and threads separate concepts; unify them as a single ‘execution unit.’“

In Linux, instead of fork(), there’s the more general clone() syscall. clone() specifies “what to share with the parent” as a bit flag.

  
/* Linux clone() — concept */
clone(fn, stack, flags, arg);

/* Example flags: */
CLONE_VM       /* share address space (true → thread, false → process) */
CLONE_FS       /* share filesystem state */
CLONE_FILES    /* share file descriptors */
CLONE_SIGHAND  /* share signal handlers */
CLONE_THREAD   /* same thread group */
/* ... */

fork() = clone() with all share flags OFF
pthread_create() = clone() with all share flags ON
Any combination in between is possible

That’s Linux’s “process and thread are on a continuum” worldview. Android for example uses “partially sharing” process clones in practice (the Zygote process).

TLS — Thread-Local Storage

Sometimes you need variables that look global but are actually independent per thread. That’s TLS.

The canonical example: errno. In POSIX, errno is “the error code of the last syscall,” but it must be per-thread (thread A’s failed read() must not be overwritten by thread B). So errno is implemented as TLS.

TLS declarations by language:

  
/* C11 */
_Thread_local int counter = 0;

/* GCC/Clang extension */
__thread int counter = 0;

  
// C++11
thread_local int counter = 0;

  
// C#
[ThreadStatic]
static int counter;

// Or the more flexible ThreadLocal<T>
static ThreadLocal<int> counter = new ThreadLocal<int>(() => 0);

Practical uses in game development:

Logging systems use TLS to store each thread’s name for inclusion in log lines
Rendering assigns per-thread command buffers that are merged later
Profilers track the current scope stack per thread

Part 4: Thread Models — 1:1, N:1, M:N

A deeper question: when you call pthread_create() or new Thread(), how does the kernel manage that thread?

Why the Question Matters

The unit that actually runs on a CPU is a kernel-level thread (KLT). Only the kernel schedules the CPU.

In contrast, the “thread” your program creates can be just a user-space abstraction. That’s called a user-level thread (ULT).

The mapping between user threads and kernel threads falls into three categories.

1:1 Model — The Current Linux/Windows Choice

In 1:1, each user-created thread maps to exactly one kernel thread. pthread_create() internally calls the clone() syscall and directly creates a kernel-managed task.

Linux NPTL (Native POSIX Thread Library): Since Linux 2.6, the glibc pthread implementation uses NPTL, a 1:1 model. Before that there was LinuxThreads, a nonstandard 1:1 implementation, which NPTL replaced on POSIX compliance and performance grounds.

Windows: CreateThread() creates a KTHREAD directly in the kernel. 1:1 again.

Pros: if one thread blocks, others keep running. Natural distribution across cores.

Cons: thread creation is relatively expensive; tens of thousands or more strain kernel memory.

N:1 Model — A Legacy

In N:1, multiple user threads map to one kernel thread. The kernel doesn’t know the process has multiple threads — it sees just one process.

This model was used in early Java “green threads,” in GNU Pth, and others. It was standard in the early 1990s but fatal drawbacks nearly drove it extinct:

A blocking syscall stops everyone: one user thread blocking in read() freezes all others sharing the kernel thread
No multicore use: one kernel thread lives on one core

M:N Model — Go’s Choice

M:N combines the two. M user threads map dynamically to a pool of N kernel threads (typically N = number of CPU cores).

Representative implementations:

Go goroutines: the Go runtime has an M:N scheduler, running millions of goroutines over a handful of OS threads
Erlang/Elixir: the BEAM VM implements its own scheduler
Old Solaris (Solaris 2–8): implemented POSIX pthreads M:N, but Solaris 9 switched to 1:1 for complexity reasons

Theoretical grounding — Anderson et al.’s SOSP 1991 Scheduler Activations paper tackles “what kernel support is needed so a user-level thread library can efficiently implement M:N.” The key is that on blocking syscalls the kernel should wake the user scheduler so it can assign another user thread to another kernel thread.

The Go runtime implements a similar idea. When a goroutine tries a blocking syscall, the runtime detects it and either migrates that goroutine to another kernel thread or spawns a new one. So one net.Listen blocking doesn’t affect other goroutines.

From a Game Development Perspective

The threads used in Unity and Unreal are 1:1 at the C++/C# layer. new Thread() or std::thread creates kernel threads directly.

However, engines’ internal Job systems or Task graphs are effectively M:N schedulers. The programmer can queue thousands of “Jobs,” yet they run on the engine’s handful of worker threads. This ties directly into the Unity Job System design we address in detail in Part 13 (Lock-free and Structural Solutions).

Part 5: Three-OS Thread APIs Compared

Linux — pthreads

  
#include <pthread.h>
#include <stdio.h>

void* worker(void* arg) {
    int id = *(int*)arg;
    printf("Thread %d running\n", id);
    return NULL;
}

int main() {
    pthread_t t1, t2;
    int id1 = 1, id2 = 2;

    pthread_create(&t1, NULL, worker, &id1);
    pthread_create(&t2, NULL, worker, &id2);

    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
    return 0;
}

POSIX-standard API. Internally it calls the clone() syscall. Officially it’s “pthread,” but Linux man pages are really documenting NPTL (the glibc implementation).

Windows — CreateThread / _beginthreadex

  
#include <windows.h>
#include <process.h>

unsigned __stdcall worker(void* arg) {
    int id = *(int*)arg;
    printf("Thread %d running\n", id);
    return 0;
}

int main() {
    HANDLE t1, t2;
    int id1 = 1, id2 = 2;

    t1 = (HANDLE)_beginthreadex(NULL, 0, worker, &id1, 0, NULL);
    t2 = (HANDLE)_beginthreadex(NULL, 0, worker, &id2, 0, NULL);

    WaitForSingleObject(t1, INFINITE);
    WaitForSingleObject(t2, INFINITE);
    CloseHandle(t1);
    CloseHandle(t2);
    return 0;
}

Why not CreateThread? CreateThread skips CRT (C Runtime Library) initialization — so thread-local state like errno and strtok isn’t set up, causing subtle bugs. _beginthreadex initializes the CRT, so for C/C++ code you should use it.

macOS — pthreads + libdispatch

  
/* POSIX style — same as Linux */
#include <pthread.h>
/* ... */

/* libdispatch (GCD) style — Apple's preferred */
#include <dispatch/dispatch.h>

int main() {
    dispatch_async(dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0), ^{
        printf("Running in background\n");
        dispatch_async(dispatch_get_main_queue(), ^{
            printf("Back to main thread\n");
        });
    });

    dispatch_main();
    return 0;
}

macOS supports pthreads too, but Apple recommends GCD (Grand Central Dispatch). We covered the rationale in Part 7 — no manual thread lifetime management, QoS-based routing to P/E cores, a predictable queue abstraction.

C# — Language-Level Abstraction

C# works on all three OSes. The .NET runtime (CLR or CoreCLR) hides OS differences.

  
using System;
using System.Threading;
using System.Threading.Tasks;

// 1) Most primitive — rarely used today
Thread t = new Thread(() => Console.WriteLine("Hello"));
t.Start();
t.Join();

// 2) ThreadPool — thread reuse
ThreadPool.QueueUserWorkItem(_ => Console.WriteLine("Hello"));

// 3) Task / async-await — modern default
await Task.Run(() => HeavyComputation());

// 4) Parallel — data parallelism
Parallel.For(0, 100, i => ProcessItem(i));

Underneath:

Linux: libcoreclr uses pthread_create()
Windows: uses CreateThread()
macOS: uses pthread_create() (doesn’t use GCD directly)

Unity’s quirk: Unity discourages Thread usage. It nudges you to Job System, UniTask, and coroutines instead. Because most Unity Engine APIs crash if called outside the main thread. (See Part 13 for details.)

Part 6: Context Switching — Why It’s Expensive

What a Context Switch Is

To run multiple threads alternately on one CPU core, the OS saves the current thread’s state and restores the next thread’s state. That’s context switching.

What must be saved:

CPU registers: RAX, RBX, …, RIP (program counter), RSP (stack pointer), flags
Floating-point registers: XMM, YMM, ZMM (tens of KB in the AVX era)
MMU state: on a process switch, the page table pointer (CR3 on x86) changes too

The “Hidden Cost” of Context Switching

Saving and restoring registers is just the tip of the iceberg. The real cost is indirect effects.

TLBs and Process-to-Process Switches

The TLB (Translation Lookaside Buffer) is a small CPU cache that stores “virtual address → physical address” lookups. Typical L1 TLBs have 64–128 entries.

When a process switch occurs, the CR3 register (the page-table base) changes, and the TLB is fully flushed (absent PCID/ASID optimizations). Every subsequent memory access then has to walk the page tables again.

Thread-to-thread switches are cheaper — threads share an address space, so CR3 doesn’t change and the TLB isn’t flushed. That’s one concrete reason “threads are lighter than processes.”

Measuring It

On Linux you can measure with perf stat:

  
$ perf stat -e context-switches,cpu-migrations,cache-misses -p <PID> sleep 10

Performance counter stats for process id '1234':

     12,345      context-switches
        567      cpu-migrations
 10,234,567      cache-misses

On macOS, Instruments’s System Trace template lets you observe thread scheduling and context switches at microsecond resolution.

On Windows, Xperf and Windows Performance Analyzer fill the same role.

A LaMarca & Ladner Observation

As LaMarca & Ladner 1996, “The Influence of Caches on the Performance of Heaps” argues, theoretical asymptotic complexity alone cannot predict real performance. By the same token, the naive expectation that “more threads = faster” breaks down because of cache/TLB costs.

The rule “optimal thread count = core count” comes from this observation. Beyond that, context switching eats the gains.

Part 7: Game Engine Execution Models

Now we link theory to game engines.

Unity — The Hard Main-Thread Constraint

If you’ve used Unity, you’ve likely seen the warning “this API can only be called on the main thread.” Most Unity Engine APIs — Transform.position, GameObject.Instantiate(), Renderer.sharedMaterial, etc. — are main-thread only.

Why?

The Unity Engine is written in C++ and its internal data structures have no locks. Unity’s team assumed “all engine calls come from the main thread” by design, eliminating lock-acquisition overhead.

This is a deliberate trade-off:

✅ Engine calls are very fast (no locks)
❌ Multithreaded use is awkward

Unity’s answer: Job System + Burst + Native Containers. Leave the main thread alone and provide a separate layer that parallelizes only the data processing. (Details in Part 13.)

Unreal Engine — The Task Graph

Unreal Engine uses a Task Graph system. “Tasks” submitted by game code form a dependency DAG that the engine spreads across a worker thread pool.

Unreal’s worker threads:

Game Thread: game logic (Unity’s main thread equivalent)
Render Thread: build rendering commands
RHI Thread: GPU driver calls
Worker Threads: general-purpose work

Tasks specify their target via ENamedThreads. Examples: ENamedThreads::GameThread, ENamedThreads::AnyBackgroundHiPriTask.

Fiber — Naughty Dog’s Approach

Christian Gyrling’s GDC 2015 talk “Parallelizing the Naughty Dog Engine Using Fibers” is famous for its fiber-based engine design.

A fiber is a cooperative user-level thread. The OS isn’t involved; the application switches them itself. If kernel threads are workers, fibers are the tasks the workers are carrying at the moment.

Fiber creation cost: extremely cheap (nanoseconds)
Fiber switch: save/restore registers only, no kernel involvement
Can dispatch thousands

Naughty Dog’s The Last of Us Part II used this system to reliably exploit the PS4’s 7 cores. Fibers can be viewed as one form of the M:N model (fibers = user threads, kernel threads = workers).

Windows fiber API: CreateFiber, SwitchToFiber. On macOS/Linux you’d use ucontext.h’s makecontext/swapcontext (legacy, discouraged) or libraries like Boost.Context and libco.

Engine Execution Models Compared

Part 8: Hands-On — How Are My Threads Actually Running?

Once you know the theory, it’s time to actually look. All three OSes ship rich tools for observing processes and threads.

Linux — /proc, ps, top

On Linux everything is exposed in the /proc virtual filesystem.

  
# List threads of a specific process
$ ls /proc/<PID>/task/
1234  1235  1236  ...

# State of each thread
$ cat /proc/1234/task/1234/status
Name:   myapp
State:  R (running)
Tgid:   1234
Pid:    1234
Threads: 8

# Address-space mappings
$ cat /proc/1234/maps
00400000-00452000 r-xp 00000000 08:01 12345 /usr/bin/myapp
00651000-00652000 r--p 00051000 08:01 12345 /usr/bin/myapp
7f1234000000-7f1234021000 r-xp 00000000 08:01 54321 /lib/x86_64-linux-gnu/libc.so.6
...

top -H shows per-thread CPU usage.

macOS — Activity Monitor, ps, Instruments

Activity Monitor is the GUI tool, but more precise data lives in CLI tools.

  
# Show thread count for a process
$ ps -M <PID>

# Detailed info
$ sample <PID> 5 -mayDie

The most powerful option is Instruments’ System Trace template. It shows a per-P/E-core execution timeline, context-switch events, and blocking causes. It’s especially useful on Apple Silicon — visualizing which threads ran on P-cores and which were pushed to E-cores.

Windows — Process Explorer, WPA

Process Explorer (Sysinternals) is a beefed-up Task Manager:

Visualized process tree
Thread list per process with stack traces
Handles, DLLs, memory details

Windows Performance Analyzer (WPA) is the Instruments equivalent, analyzing ETW events collected via Xperf.

Threads in C# — Code Example

  
using System;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;

class ThreadInspector {
    static void Main() {
        Console.WriteLine($"Current process ID: {Process.GetCurrentProcess().Id}");
        Console.WriteLine($"Managed thread ID: {Thread.CurrentThread.ManagedThreadId}");
        Console.WriteLine($"CPU core count: {Environment.ProcessorCount}");

        // Measure thread-creation cost
        var sw = Stopwatch.StartNew();
        var threads = new Thread[100];
        for (int i = 0; i < 100; i++) {
            threads[i] = new Thread(() => Thread.Sleep(1));
            threads[i].Start();
        }
        foreach (var t in threads) t.Join();
        sw.Stop();
        Console.WriteLine($"100 thread create+join: {sw.ElapsedMilliseconds}ms");

        // ThreadPool.QueueUserWorkItem is much faster
        sw.Restart();
        var countdown = new CountdownEvent(100);
        for (int i = 0; i < 100; i++) {
            ThreadPool.QueueUserWorkItem(_ => {
                Thread.Sleep(1);
                countdown.Signal();
            });
        }
        countdown.Wait();
        sw.Stop();
        Console.WriteLine($"100 ThreadPool items: {sw.ElapsedMilliseconds}ms");
    }
}

Run output (approximate on my machine):

Current process ID: 12345
Managed thread ID: 1
CPU core count: 8
100 thread create+join: 85ms
100 ThreadPool items: 8ms

10× difference. That’s the practical case for thread pools. .NET’s ThreadPool, Java’s ExecutorService, C++’s std::async all share the idea — amortize creation cost by reusing threads.

Wrap-up

What this post covered:

Processes:

PCB (task_struct, EPROCESS, proc+task) — how the OS tracks a process
Address space layout: Text, Data, BSS, Heap, Stack, Kernel
State transitions: New, Ready, Running, Waiting, Terminated

Process creation:

Unix fork() + exec() — two steps, Copy-on-Write keeps it fast
Windows CreateProcess() — one step, many parameters
macOS posix_spawn() — iOS-compatible, more efficient
COW in fork() relies on hardware MMU support

Threads:

Process vs thread: whether the address space is shared is the key
Shared: text, data, heap, file descriptors
Private: stack, registers, TLS
Linux’s peculiar philosophy: same struct for process and thread (clone())

Thread mapping models:

1:1 (Linux NPTL, Windows): standard, true parallelism
N:1 (old green threads): nearly obsolete
M:N (Go goroutines, Erlang): millions of concurrent threads, complex runtime

Context switching:

Direct cost: register save/restore ~1–10μs
Hidden cost: TLB flush, cache pollution, branch predictor pollution
Process switches are more expensive than thread switches (CR3 change)
“Thread count = core count” rule

Game engine execution models:

Unity: main thread constraint + Job System (data parallelism)
Unreal: multiple named threads + Task Graph
Naughty Dog engine: fiber-based cooperative scheduling

The next post is Part 9 Scheduling — when several threads are all Ready, who does the OS hand the CPU to? We look at Linux’s CFS → EEVDF, Windows’ priority boost, and macOS’s QoS-based scheduling. We also cover the 16.67 ms game frame budget and the priority-inversion problem.

References

Textbooks

Silberschatz, Galvin, Gagne — Operating System Concepts, 10th ed., Wiley, 2018 — Ch.3 (Processes), Ch.4 (Threads)
Bovet, Cesati — Understanding the Linux Kernel, 3rd ed., O’Reilly, 2005 — task_struct and process management Ch.3
Mauerer — Professional Linux Kernel Architecture, Wrox, 2008 — modern Linux kernel internals
Russinovich, Solomon, Ionescu — Windows Internals, 7th ed., Microsoft Press, 2017 — EPROCESS/ETHREAD details
Singh — Mac OS X Internals: A Systems Approach, Addison-Wesley, 2006 — XNU’s task/proc dual structure
Butenhof — Programming with POSIX Threads, Addison-Wesley, 1997 — the classic pthreads reference
Stevens, Rago — Advanced Programming in the UNIX Environment, 3rd ed., Addison-Wesley, 2013 — fork/exec in practice
Gregory — Game Engine Architecture, 3rd ed., CRC Press, 2018 — Ch.8 multiprocessor engine design

Papers

Anderson, Bershad, Lazowska, Levy — “Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism”, SOSP 1991 — DOI — theoretical basis for the M:N model
Mogul, Borg — “The Effect of Context Switches on Cache Performance”, ASPLOS 1991 — measuring the hidden cost of context switches
Engelschall — “Portable Multithreading: The Signal Stack Trick for User-Space Thread Creation”, USENIX 2000 — implementing user-level threads
Kleiman, Smaalders — “The LWP Framework: Building and Debugging Mach Tasks and Threads”, Mach Workshop 1990 — Mach’s thread model

Official Docs

Linux man pages — clone(2), fork(2), pthread_create(3), proc(5) — man7.org
Apple Developer — Threading Programming Guide — developer.apple.com
Apple Developer — Dispatch — developer.apple.com/documentation/dispatch
Microsoft Docs — Processes and Threads — learn.microsoft.com
Microsoft Docs — Fibers — learn.microsoft.com
Go Runtime — The Go Scheduler (Dmitry Vyukov) — morsmachine.dk/go-scheduler

Game Development / GDC Resources

Gyrling, C. — Parallelizing the Naughty Dog Engine Using Fibers, GDC 2015 — gdcvault.com
Unity Technologies — C# Job System manual — docs.unity3d.com
Unreal Engine Documentation — Task Graph System — dev.epicgames.com
Fabian Giesen — Reading List on Multithreading and Synchronization — fgiesen.wordpress.com

Blogs / Articles

Raymond Chen — The Old New Thing — internals of Win32 CreateProcess
Linus Torvalds — early comp.os.minix thread-related discussions (1992)
Dmitry Vyukov — 1024cores.net — lock-free concurrency reference (including Go scheduler internals)
Howard Oakley — The Eclectic Light Company — macOS thread-observability techniques

Tools

Linux: ps, top, htop, strace, perf, ftrace
macOS: Activity Monitor, ps, sample, Instruments (System Trace, Time Profiler)
Windows: Task Manager, Process Explorer, WPA, PerfView
Cross-platform: Tracy Profiler — great for embedding in games

AI, CS

cs os process thread pcb tcb fork pthreads scheduling game-engine

This post is licensed under CC BY 4.0 by the author.