CS Roadmap Part 9 — Scheduling: Whom Does the OS Give the CPU To?
- CS Roadmap (0) — Why CS Knowledge Matters More Than Ever in the AI Era
- CS Roadmap (1) — Arrays and Linked Lists: Reading the Terrain of Memory
- CS Roadmap (2) — Stack, Queue, Deque: Powerful Abstractions Born from Restriction
- CS Roadmap (3) — Hash Tables: Conditions and Limits of O(1) Lookup
- CS Roadmap (4) — Trees: Order, Balance, and Guaranteeing O(log n)
- CS Roadmap (5) — Graphs: The Network of Relationships, the Science of Paths
- CS Roadmap (6) — Memory Management: Stack & Heap, GC, and the Things That Eat Your Frames
- CS Roadmap (Bonus) — Heaps and Priority Queues: The Economics of Partial Order
- CS Roadmap Part 7 — OS Architecture: The Forking Paths of Unix, NT, and XNU
- CS Roadmap Part 8 — Processes and Threads: How the OS Abstracts Execution Units
- CS Roadmap Part 9 — Scheduling: Whom Does the OS Give the CPU To?
- The scheduler makes two decisions — "whom to give the CPU" and "for how long" — evaluated against Throughput, Latency, Fairness, and Response time
- Linux evolved O(n) → O(1) → CFS (2007) → EEVDF (2024). CFS always picks the thread with the smallest vruntime from an RB-tree; EEVDF adds eligibility and deadline axes to handle latency-sensitive work better
- Windows boosts responsiveness with 32 priority levels plus dynamic boost (foreground window, I/O completion, GUI input). macOS uses 5 QoS classes that simultaneously decide priority, P/E core placement, and power management
- 60fps gives 16.67ms and 120fps gives 8.33ms to finish input → logic → physics → render → present. A single priority inversion is enough to drop a frame
- Unity Job priority, Unreal TaskGraph named threads, and SetThreadPriority / pthread / dispatch_qos are different abstractions sitting on the same OS scheduler
Introduction: The Question of “Who, and How Much”
The previous post showed what processes and threads are and how the OS abstracts them. A natural follow-up question arises.
If 100 threads are ready and there are only 8 cores, whom does the OS give the CPU to? And for how long?
Answering these two questions is the job of the scheduler. The answer determines two things:
- Perceived responsiveness: does the click react within 0.1 seconds, or take a full second?
- Frame stability: does the game hold 60fps, or occasionally exceed 17ms?
This post covers:
- Scheduling basics: Preemptive vs Cooperative, the trade-offs between Throughput, Latency, Fairness, Response
- Classic algorithms: FCFS, SJF, RR, Priority, MLFQ
- Linux: the evolution from O(n) → O(1) → CFS → EEVDF
- Windows: 32 priority levels, dynamic boost, foreground quantum stretch
- macOS: QoS-based scheduling and Apple Silicon P/E core placement
- Game frame budget: what fills 16.67ms and how priority inversion ruins it
- How game engines use priority and affinity: Unity, Unreal, and the OS API level
The core question for Stage 2 — “Why does a program with two threads writing to the same variable only crash sometimes?” — hinges on the word “sometimes”, and that word ultimately comes out of how the scheduler operates. The order and timing of execution mixing is being decided in a place you cannot see.
Part 1: Why Scheduling Is Necessary
The Illusion of Multitasking
A desktop has a browser, IDE, Slack, Spotify, and Discord open at once. There are 8 cores, while processes and threads typically number in the hundreds. Yet they all appear to run “simultaneously.”
In reality, the OS is swapping threads in and out very rapidly. One thread holds a core for a few milliseconds, then yields to another, then to another. Swap faster than the human perception threshold (~50ms) and it feels concurrent.
The scheduler decides this swapping and answers two questions:
- Whom to give it to — among threads in the Ready state, which one should be placed on a core?
- For how long — once placed, when can it be taken back, or can it be?
Preemptive vs Cooperative
Preemptive: the scheduler can forcibly take a thread off the CPU. Periodic timer interrupts wake the kernel, which decides “who is next.” Almost every modern OS — Linux, Windows, macOS — uses this approach.
Cooperative: a thread runs until it voluntarily yields (yield). 80s Macs, pre-95 Windows, and some current coroutine/Fiber systems work this way. A single thread stuck in an infinite loop freezes the system — one cause of the old “Mac bomb icon.”
Hold on, let’s clarify this. Are goroutines and async/await cooperative?
Go’s goroutines are partially cooperative. Yield points exist only at function calls, channel operations, and GC safe points. Without a function call inside an infinite loop, other goroutines could starve, which is why Go 1.14 introduced async preemption. async/await is similar — yield points only at
await. But the underlying threads still run on the OS’s preemptive scheduler, so the two layers are stacked.
Scheduler Evaluation Criteria
A scheduler design must consider several metrics, and they conflict with each other.
| Metric | Meaning | Who likes it |
|---|---|---|
| Throughput | Tasks completed per unit time | Batch processing, build servers |
| Turnaround time | Total elapsed time per job | Compilers, data processing |
| Waiting time | Time spent in the Ready queue | All jobs |
| Response time | Input → first reaction | Desktops, games |
| Fairness | Even resource distribution | Multi-user systems |
| Predictability | Predictable latency | Real-time systems, games |
| Energy | Power efficiency | Mobile, laptops |
Desktop and mobile typically prioritize Response time + Energy. Servers prioritize Throughput + Fairness. Real-time and games prioritize Predictability. This is why no single algorithm is optimal for every environment.
Part 2: Classic Scheduling Algorithms
Before looking at full-fledged OS schedulers, here are five textbook algorithms. Modern schedulers are all combinations or evolutions of these ideas.
FCFS (First-Come, First-Served)
The simplest. Run jobs in arrival order, and once a job starts, never preempt it (non-preemptive).
The problem is the convoy effect. If a 100-second job arrives first, all the 0.1-second jobs behind it wait 100 seconds. Average waiting time explodes.
SJF / SRTF (Shortest Job First / Shortest Remaining Time First)
Run the shortest job first. Mathematically proven to be optimal in average waiting time.
Problem 1: You must know the job length in advance. In practice you can’t, so you estimate from past execution history. Problem 2: starvation — if short jobs keep arriving, long ones may never start.
Round Robin (RR)
Cycle through the ready queue and give each thread a time quantum of CPU. When the quantum expires, send it to the back of the queue and move on.
The quantum size is the key parameter.
- Too large: behaves like FCFS, responsiveness drops
- Too small: context-switch overhead eats up actual work time
Typical values are 10~100ms. Linux decides dynamically (CFS); Windows uses about 6ms (12ms+ on servers).
The diagram below compares FCFS and RR for the same workload (A=8ms, B=4ms, C=2ms) arriving simultaneously.
Priority Scheduling
Assign each thread a priority and run the highest-priority one first. Within the same priority, use RR.
The problem is again starvation. Low-priority threads may never run. One remedy is aging, gradually raising the priority of threads that have waited a long time.
Another deeper problem is priority inversion. If a low-priority thread holds a lock that a high-priority thread is waiting on, while a medium-priority thread keeps preempting the low one, the result is a paradox — the high one is blocked by the medium one. The next post (synchronization) takes this on directly.
MLFQ (Multi-Level Feedback Queue)
Several queues are arranged by priority, and threads move between queues based on observed behavior.
The basic rules are:
- New threads enter the highest queue
- Using up the time quantum drops them down one level
- Yielding to I/O before quantum expiry keeps them at the same level or moves them up
The interesting consequence:
- I/O-bound work (interactive GUI, game input): short bursts then yield → stays in higher queue → fast response
- CPU-bound work (compilers, encoders): long bursts → drops to lower queues → does not interfere with responsive work
The core idea is classifying jobs by behavior even when the algorithm has no idea what kind of work they are. Windows’ dynamic boost, macOS’ QoS adjustments, and Linux’s sleeper bonus are all variants of the same idea.
MLFQ was used directly by Solaris, classic Mac OS, and Windows NT. Modern OSes ostensibly use different algorithms (e.g., CFS), but their internal heuristics resemble MLFQ.
Part 3: The Linux Scheduler — O(n) → O(1) → CFS → EEVDF
The evolution of the Linux scheduler is unbeatable as a study material. The same problem has been re-solved four times, and what was wrong and how it was fixed is fully public.
O(n) Scheduler — Pre-2.4
The early Linux scheduler walked the entire ready queue for each decision. In an era of few cores and few processes, this was fine. But once servers began running thousands of processes, the scheduler itself became the bottleneck. Adding cores didn’t help due to lock contention.
O(1) Scheduler — 2.6 (Ingo Molnár, 2003)
Introduced by Ingo Molnár in late 2002.
Core ideas:
- 140 priority queues (real-time 0–99, normal 100–139)
- A pair of active and expired queues per priority
- The next thread is chosen in constant time — just find the highest set bit in a bitmap
It also added an interactive task bonus as a heuristic. Longer sleep raised priority slightly, improving desktop responsiveness. But the heuristic grew increasingly complex, and workloads were found that gamed the bonus calculation, leaving the code a patchwork.
CFS — Completely Fair Scheduler (2.6.23, 2007)
Built again by Ingo Molnár. He credited inspiration to Con Kolivas’s RSDL (Rotating Staircase Deadline) patch.
The core idea was to define “fairness” not as simple rotation but as balancing accumulated execution time. Every thread has a virtual CPU time it should have received — vruntime — and the scheduler always picks the thread with the smallest vruntime.
Virtual runtime (vruntime) is the actual runtime (runtime) corrected by weight.
Here $w$ is the thread’s weight (set by nice), and $w_0$ is the reference weight for nice 0 (1024). Lower nice (higher priority) means larger weight, so vruntime grows slower and the thread gets picked more often.
The data structure is a Red-Black Tree keyed on vruntime. The leftmost node (smallest vruntime) is the next runner, and insert / remove / select are all $O(\log n)$. Asymptotically slower than O(1), but in practice n is small enough that it doesn’t matter, and removing the heuristics made the code far cleaner.
The diagram below shows CFS’s core cycle. Pick the thread with the smallest vruntime from the RB-tree, run it, then re-insert it with the updated vruntime.
(leftmost = T0)
vruntime += Δ × w₀ / w_T0
re-insert in RB-tree with new v
CFS’s key parameters are:
sched_latency_ns— the target time to run every ready thread once per cycle (default 6ms × number of cores)sched_min_granularity_ns— minimum runtime per slice for one thread (default 0.75ms)sched_wakeup_granularity_ns— vruntime difference threshold required for a waking thread to preempt the current one
You can inspect them via sysctl -a | grep sched, and they auto-adjust based on core count.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
/* Simplified CFS pick logic */
struct task_struct *pick_next_task_fair(struct rq *rq) {
struct cfs_rq *cfs_rq = &rq->cfs;
struct sched_entity *se = __pick_first_entity(cfs_rq); /* RB-tree leftmost */
return container_of(se, struct task_struct, se);
}
/* Called every tick — update vruntime and rebalance */
void update_curr(struct cfs_rq *cfs_rq) {
struct sched_entity *curr = cfs_rq->curr;
u64 delta_exec = now - curr->exec_start;
curr->vruntime += calc_delta_fair(delta_exec, curr);
/* If preemption is warranted, resched_curr() */
}
EEVDF — Earliest Eligible Virtual Deadline First (6.6, 2023~2024)
CFS served well for 16 years, but one structural issue remained: expressing latency-sensitive work was hard.
CFS could control how often a thread runs through nice, but not how quickly it must respond. The two concepts were collapsed onto the same axis.
Peter Zijlstra brought EEVDF into mainline starting in 2023, and it became the default scheduler in the 6.6 LTS kernel. The academic background is the 1996 paper by Stoica, Abdel-Wahab, Jeffay, Baruah.
EEVDF’s two axes are:
- Eligibility — has this thread already run its share? If not, it’s eligible
- Virtual Deadline — among eligible threads, pick the one whose deadline comes soonest
Deadline is computed as:
\[\text{deadline} = \text{eligible time} + \frac{\text{request size}}{\text{weight}}\]A smaller request size (a new parameter called latency-nice) yields an earlier deadline, leading to more frequent preemption. In other words, work like a game’s main thread that “doesn’t run that often, but must respond instantly when it does” can finally be expressed exactly.
1
2
3
4
5
6
7
8
9
/* Linux 6.6+ : set latency-nice independently from nice */
struct sched_attr attr = {
.sched_policy = SCHED_NORMAL,
.sched_nice = 0,
.sched_runtime = 1 * 1000 * 1000, /* 1ms */
.sched_deadline = 16 * 1000 * 1000, /* 16.67ms */
.sched_period = 16 * 1000 * 1000,
};
sched_setattr(pid, &attr, 0);
Even with EEVDF, vruntime-based fairness remains. EEVDF is less a replacement for CFS than a refinement of the selection policy, and external interfaces (
nice,cgroup cpu.weight) are mostly unchanged.
Linux’s Other Scheduling Classes
Linux uses not a single algorithm but a layered set of classes. Each class has a priority, and lower classes don’t run while higher ones have work.
| Class | Policy | Use |
|---|---|---|
| stop | (kernel only) | CPU hotplug, RCU, etc. |
| dl | SCHED_DEADLINE | Real-time (period + runtime + deadline guaranteed) |
| rt | SCHED_FIFO, SCHED_RR | Real-time priorities 1–99 |
| fair | SCHED_NORMAL/BATCH/IDLE | General, CFS/EEVDF |
| idle | (when all idle) | swapper |
Why you should not use SCHED_FIFO/RR carelessly in games: misuse can freeze the whole system — a single priority-99 infinite loop renders that core unresponsive forever after. Even genuinely RT audio threads are safer accessed via OS-provided higher abstractions like dispatch_qos / AVAudioSession.realtime rather than direct SCHED_FIFO.
Part 4: The Windows Scheduler — Priority + Boost
The Windows NT scheduler is a 32-level priority system designed by Dave Cutler based on his VMS experience. The core skeleton has been essentially unchanged since NT 3.1 (1993); over time, only heuristics and hardware-adaptive code have been layered on top.
32 Priority Levels
| Level | Meaning |
|---|---|
| 0 | Zero page thread (for zeroing memory) |
| 1~15 | Variable priority (regular processes, subject to dynamic adjustment) |
| 16~31 | Real-time priority (admin needed, no dynamic adjustment) |
Each process has a Priority Class, and within it threads are fine-tuned via Thread Priority.
1
2
3
4
5
6
/* Windows priority = Process Class + Thread Priority offset */
SetPriorityClass(hProcess, NORMAL_PRIORITY_CLASS); /* base 8 */
SetThreadPriority(hThread, THREAD_PRIORITY_NORMAL); /* offset 0 */
/* HIGH_PRIORITY_CLASS = 13, THREAD_PRIORITY_HIGHEST = +2 → effective 15 */
/* REALTIME_PRIORITY_CLASS = 24, ... */
Quantum
Windows’ time quantum is measured as a multiple of clock interval. The clock interval is typically about 15ms (HPET-based) or 1ms (when the multimedia timer is enabled).
- Workstation: 2 clock intervals (default ~30ms but typically shorter after post-boot adjustments)
- Server: 12 clock intervals (long quantum favors throughput)
Furthermore, the quantum is stretched for foreground processes. The thread of the window the user is currently looking at gets longer time, improving responsiveness (Control Panel → System → Advanced → Performance Options → Advanced has a “Programs” / “Background services” toggle that turns this on and off).
Priority Boost — Windows’ Core Heuristic
Threads in the variable priority range (1~15) get temporarily boosted by various events. The boost decreases by 1 each quantum until it returns to the base.
| Event | Boost amount |
|---|---|
| Disk I/O complete | +1 |
| Network / Mailslot | +2 |
| Mouse / Keyboard input | +6 |
| Sound card | +8 |
| GUI thread receives a message | +2 (foreground extra) |
| Semaphore wait release | +1 |
| Mutex/Event/Timer wait release | +1 |
This heuristic is the mechanism behind Windows’ responsiveness. Move the mouse and the GUI thread gets +6; key input gets +6 too. Even when CPU-bound work is running, input remains responsive.
Hold on, let’s clarify this. If GUI thread boost is +6, how does priority get distributed across multiple windows?
Only the focused window’s thread receives the additional foreground boost. The moment Alt-Tab changes the active window, the boost distribution changes too. Turn on the priority column in Process Explorer and click another window — you’ll see the priority of that window’s thread briefly tick up.
The Realtime Priority Trap
Levels 16~31 have no dynamic boost — they always run at that priority. In theory this means “never yields.” Audio, video capture, and some game threads use 16~22.
But using REALTIME_PRIORITY_CLASS (24~31) in regular code is dangerous. A single infinite loop at 24 or above can freeze even the mouse cursor — mouse handling is also just a thread.
NUMA, SMT, Heterogeneous
The modern Windows scheduler considers NUMA nodes, SMT (hyperthreading), and Intel Thread Director (P/E core hints) all together. Hardware Threaded Scheduling, introduced in Windows 11, takes Thread Director hints to adjust P/E core placement — what Apple Silicon does at the OS level, Intel solves through OS·CPU collaboration.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
/* Windows: thread priority adjustment + affinity */
HANDLE h = GetCurrentThread();
SetThreadPriority(h, THREAD_PRIORITY_TIME_CRITICAL); /* +15 */
/* Pin to cores 0 and 1 */
DWORD_PTR mask = 0x3;
SetThreadAffinityMask(h, mask);
/* Windows 10+ : E-core preference hint */
THREAD_POWER_THROTTLING_STATE state = {};
state.Version = THREAD_POWER_THROTTLING_CURRENT_VERSION;
state.ControlMask = THREAD_POWER_THROTTLING_EXECUTION_SPEED;
state.StateMask = THREAD_POWER_THROTTLING_EXECUTION_SPEED;
SetThreadInformation(h, ThreadPowerThrottling, &state, sizeof(state));
The final THREAD_POWER_THROTTLING_EXECUTION_SPEED is an API that hints to the OS, “this thread is fine running slowly on an E-core.” Apply it to background work and the P-cores stay free for the game’s main thread.
Part 5: The macOS Scheduler — QoS + P/E Cores
macOS appears Mach-based on the surface, but its scheduler is a BSD-based priority system topped with the higher-level QoS (Quality of Service) abstraction. Developers almost always express priority through QoS, and the kernel translates it into priority + core placement + power management.
5 QoS Classes
| QoS Class | Meaning | Examples | Mapped priority |
|---|---|---|---|
| User Interactive | Immediate response, what the user is directly seeing | Main thread, animation, input | 47 |
| User Initiated | User-started work the user is waiting for | File open, search | 37 |
| Default | Unspecified | General work | 31 |
| Utility | User doesn’t need the result immediately (progress shown) | Downloads, imports | 20 |
| Background | Invisible to the user | Indexing, backup | 5 |
This QoS value determines:
- CPU priority — the number above
- CPU scheduling latency — User Interactive wakes fast, Background batches work
- I/O priority — disk queue priority
- CPU core placement (Apple Silicon) — User Interactive/Initiated prefer P-cores, Utility/Background prefer E-cores
- Timer coalescing — Background batches timer firings
- GPU priority — affects some graphics workloads
A single line of QoS decides all six at once.
QoS API
1
2
3
4
5
6
7
8
9
10
11
12
13
/* C/Objective-C — set the current thread's QoS */
pthread_set_qos_class_self_np(QOS_CLASS_USER_INTERACTIVE, 0);
/* When creating a GCD queue */
dispatch_queue_t q = dispatch_queue_create_with_target(
"com.example.render",
DISPATCH_QUEUE_SERIAL,
dispatch_get_global_queue(QOS_CLASS_USER_INTERACTIVE, 0));
/* Attach QoS to a dispatch_async */
dispatch_async(q, ^{
/* Runs at User Interactive */
});
1
2
3
4
5
6
7
8
9
// Swift
DispatchQueue.global(qos: .userInteractive).async {
// Quick work to relieve the main thread
}
// Operation API
let op = BlockOperation { /* ... */ }
op.qualityOfService = .userInitiated
queue.addOperation(op)
QoS Inheritance — Priority Inversion Prevention
QoS propagates automatically. If work dispatched on a User Interactive queue does dispatch_sync on another queue, the target queue’s QoS is temporarily boosted to User Interactive. This mechanism is at the heart of macOS’s priority inversion mitigation.
The same applies to locks. os_unfair_lock raises the lock holder’s QoS to the QoS of any waiter — POSIX’s PTHREAD_PRIO_INHERIT done automatically at the OS level.
QoS → priority → P/E core mapping
The diagram below shows how a single line of QoS translates into Mach priority and Apple Silicon’s P/E core placement.
Game Mode (macOS 14+)
Game Mode, added in macOS Sonoma, is a mode the OS auto-enables when a game is fullscreen. Effects:
- More aggressive QoS suppression for background work (Spotlight indexing, Time Machine, etc.)
- Stronger P-core priority for the game process
- Doubled audio·input polling rates for AirPods / PS5 controllers
The idea is similar to iOS’s Sustained Performance API — telling the OS, “this app cannot afford to miss 16.67ms right now,” so system-wide resource allocation adjusts.
Part 6: Game Frame Budget — 16.67ms
Now let’s transfer the OS theory above into a game context. For game developers, scheduling is ultimately about the frame budget.
The Math of the Frame Budget
\[\text{frame budget} = \frac{1000\,\text{ms}}{\text{target FPS}}\]| Target FPS | Frame budget | Who uses it |
|---|---|---|
| 30 | 33.33ms | Console cinematics, some mobile |
| 60 | 16.67ms | Standard games |
| 90 | 11.11ms | VR minimum |
| 120 | 8.33ms | High-FPS PC, PS5 Performance |
| 144 | 6.94ms | High-refresh monitors |
| 240 | 4.17ms | Competitive FPS, e-sports |
All of the following must finish within this time:
- Input handling — keyboard, mouse, gamepad, touch
- Game Logic — AI, behavior, state updates
- Physics / Collision — one step of discrete simulation
- Animation — bone matrix computation, blending
- Particle / VFX — particle updates
- Render command build — draw call sorting, culling
- GPU submit — command buffer queueing
- Present — back buffer → screen (including VSync wait)
Game engines distribute this across multiple threads. Drawn on a timeline, what happens in one frame looks like this.
One-Frame Pipeline
Most engines stagger Main and Render by one frame. While Main builds the game state for frame N, Render is dispatching GPU work for the state of frame N-1. The two threads don’t touch the same data simultaneously, reducing locks, but input lag increases by one frame.
VR and e-sports titles are extremely sensitive to this trade-off. Features like NVIDIA Reflex and AMD Anti-Lag try to reduce this pipeline depth.
Frame Spike Causes — A Scheduling View
When average frametime is 11ms but occasionally spikes to 23ms (a “frame spike”), the cause is often OS scheduling itself, in addition to GC, disk I/O, and syscalls.
- Context switch storm: more threads than cores → frequent swaps, cache pollution slows the main thread
- Priority inversion: main thread waits for a worker’s lock while an unrelated thread keeps preempting the worker
- NUMA miss: a thread migrates to a different NUMA node, exploding cache·memory latency
- P/E core demotion: when macOS Game Mode isn’t applied, the game’s main thread can briefly land on an E-core, doubling frametime
Mitigations:
- Pin the main and render threads to fixed cores (affinity)
- Cap workers at about
core count - 2to keep cores free for main and render - macOS: use
QOS_CLASS_USER_INTERACTIVE; Windows:THREAD_PRIORITY_HIGHEST(avoid TIME_CRITICAL when possible) - Mark background threads explicitly with BACKGROUND / lower priority — the OS will route them appropriately on P/E setups
A Priority Inversion Scenario
A common shape in games:
1
2
3
4
5
6
7
time Main (qos=USER_INTERACTIVE) Worker (qos=UTILITY) Other (qos=DEFAULT)
0ms enqueue logic idle running
1ms AI result needed → mutex_lock(M) wait holds M -
2ms (waiting) preempted by Other running ← problem
... 6ms (waiting) preempted by Other running
7ms (waiting) releases M -
7.1ms unblocked → resumes -
Main is stuck from 1ms to 7ms while Worker can’t run either; only Other runs. macOS automatically boosts the M holder (Worker)’s QoS to USER_INTERACTIVE in this case, so Other can no longer preempt Worker. POSIX’s PTHREAD_PRIO_INHERIT and Windows’ ALPC auto-boost are similar solutions. The next post (synchronization) covers this in depth.
Part 7: How Game Engines Use Priority and Affinity
Unity — Job System Priority
Unity’s C# Job System manages an internal worker thread pool (default worker count = ProcessorCount - 1), and is scheduled via JobHandle.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Unity 2022+
using Unity.Jobs;
using Unity.Collections;
[Unity.Burst.BurstCompile]
struct PhysicsStepJob : IJobParallelFor {
public NativeArray<float3> positions;
public NativeArray<float3> velocities;
public float dt;
public void Execute(int i) {
positions[i] += velocities[i] * dt;
}
}
void Update() {
var job = new PhysicsStepJob {
positions = positionsArray,
velocities = velocitiesArray,
dt = Time.deltaTime
};
JobHandle h = job.Schedule(positionsArray.Length, 64);
h.Complete(); /* Sync with main thread — must finish in this frame */
}
Counts can be controlled via ScheduleBatchedJobs() and JobsUtility.JobWorkerMaximumCount. Player Settings → Other Settings → Use job worker count also lets you set this explicitly — on an 8 P-core + 4 E-core machine, lowering workers to 8 lets Main hold a P-core more reliably.
Unity — Application.targetFrameRate, vSyncCount
1
2
3
4
5
6
7
// Mobile, locked 60fps
QualitySettings.vSyncCount = 0;
Application.targetFrameRate = 60;
// Desktop, follow monitor refresh rate
QualitySettings.vSyncCount = 1;
Application.targetFrameRate = -1;
Unreal — TaskGraph and Named Threads
1
2
3
4
5
6
7
8
9
10
11
12
// Unreal: dispatch work onto a specific thread
ENamedThreads::Type Target = ENamedThreads::GameThread; /* or RenderThread, AnyThread */
FFunctionGraphTask::CreateAndDispatchWhenReady(
[](){ /* runs on GameThread */ },
TStatId(),
nullptr,
Target);
// Parallel worker pool
ParallelFor(NumElements, [&](int32 i) {
Process(i);
}, EParallelForFlags::None);
Unreal forces explicit serialization through named threads like GameThread, RenderThread, RHIThread. Tasks queued to the WorkerPool go into a priority queue, and the Insights tool visualizes where each task ran.
Calling OS APIs Directly
Sometimes you need to set priority directly via OS APIs, bypassing the engine.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// Cross-platform thread priority — common in game engine cores
void SetThreadHighPriority(std::thread& t) {
#if defined(_WIN32)
SetThreadPriority(t.native_handle(), THREAD_PRIORITY_HIGHEST);
#elif defined(__APPLE__)
pthread_set_qos_class_self_np(QOS_CLASS_USER_INITIATED, 0);
/* or thread_policy_set + thread_extended_policy_data_t */
#elif defined(__linux__)
struct sched_param p;
p.sched_priority = 0; /* Within SCHED_NORMAL, adjust via nice */
pthread_setschedparam(t.native_handle(), SCHED_NORMAL, &p);
setpriority(PRIO_PROCESS, gettid_via_syscall(), -5);
#endif
}
Thread Affinity — Pinning to Cores
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// Linux
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(0, &set); /* core 0 */
CPU_SET(1, &set);
pthread_setaffinity_np(pthread_self(), sizeof(set), &set);
// Windows
SetThreadAffinityMask(GetCurrentThread(), 0x3);
// macOS — affinity API is deprecated, only hints possible
thread_affinity_policy_data_t policy = { 1 /* tag */ };
thread_policy_set(pthread_mach_thread_np(pthread_self()),
THREAD_AFFINITY_POLICY,
(thread_policy_t)&policy, 1);
Hold on, let’s clarify this. Why doesn’t macOS have hard affinity?
Apple’s stance is consistent — “developers don’t know better than the OS.” The OS factors in heterogeneous P/E cores, power states, thermal limits, and core parking, so an app forcibly pinning a core often loses more than it gains. Instead,
THREAD_AFFINITY_POLICYlets you hint “keep these in the same cache group”, and QoS lets you express P/E preference.
Naughty Dog’s Fiber Approach (Revisited)
Part 8 (Processes and Threads) briefly introduced Naughty Dog’s Fiber model. From a scheduling viewpoint, Naughty Dog hardly uses the OS scheduler at all.
- One worker thread per core, pinned with affinity
- Every thread pulls the next fiber from a fiber pool and runs it (cooperative)
- Fiber switching takes tens of nanoseconds (1/100 of an OS context switch’s few microseconds)
- From the OS’s view, it’s effectively “7 threads pinned to 7 cores — don’t wake them”
This is the heart of Christian Gyrling’s GDC 2015 talk. For ordinary games, this is overkill, but for AAA console titles where razor-sharp frame consistency is required, the choice is to bypass the OS and control the schedule directly.
Part 8: Hands-On Observation — Which Threads Run Where
Linux — chrt, nice, perf sched
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Change current shell's nice (smaller value = higher priority)
$ nice -n -5 ./mygame
# Inspect a running process's policy/priority
$ chrt -p $(pidof mygame)
pid 12345's current scheduling policy: SCHED_OTHER
pid 12345's current scheduling priority: 0
# Switch to SCHED_RR (root needed)
$ sudo chrt -r -p 50 $(pidof mygame)
# Trace scheduling events
$ sudo perf sched record -a sleep 10
$ sudo perf sched latency
# Task | Runtime ms | Switches | Avg delay ms | Max delay ms |
# mygame:12345 | 2543.123 | 8421 | 0.045 | 2.103 |
If Max delay exceeds 16ms, that frame likely had a frame spike.
macOS — Activity Monitor, Instruments, sample
The Instruments System Trace template is the most accurate. What it measures:
- Which thread runs on each core (P0~P7, E0~E3)
- Color coded by QoS class
- Context switch events and their reason (preemption, voluntary block, etc.)
- Thread state transitions (run / runnable / waiting / stopped)
1
2
3
4
5
6
7
8
# CPU usage per thread
$ top -F -R -o cpu -stats pid,command,cpu,th,state
# Sample call stacks of all threads in a process, 5 times at 1-second intervals
$ sample <pid> 5 1 -mayDie
# powermetrics for per-core utilization (P/E split)
$ sudo powermetrics --samplers cpu_power -i 1000
Windows — Process Explorer, WPA, Xperf
What the Threads tab in Process Explorer shows:
- Each thread’s base/dynamic priority columns
- “Stack” button for call stack
- “I/O Priority”, “Memory Priority” columns (Win10+)
Xperf / Windows Performance Recorder:
1
2
3
4
5
6
7
8
9
10
# 1: Start profile
wpr -start GeneralProfile -filemode
# 2: Run the game, exercise the workload
# 3: Stop → collect ETL
wpr -stop trace.etl
# 4: Analyze with WPA (CPU usage by Thread, Generic Events, etc.)
wpa.exe trace.etl
The two graphs in WPA, “CPU Usage (Sampled)” and “CPU Usage (Precise)”, differ importantly. Sampled is averaged; Precise is based on context-switch events and accurate for frame spike analysis.
The Habit of Measuring
Forming the habit of measuring what the scheduler does instead of guessing is essential. Tracy Profiler embeds in your engine and visualizes every thread’s frame-by-frame activity at nanosecond granularity — both Unity and Unreal have integrated plugins.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// Tracy usage example
#include "Tracy.hpp"
void GameLoop() {
ZoneScoped; /* Auto-instrumented per function */
{
ZoneScopedN("AI Update");
UpdateAI();
}
{
ZoneScopedN("Physics Step");
StepPhysics();
}
}
Tracy also offers macros like LockableBase and FrameMark for synchronization·frame-boundary markers, which makes priority inversion easier to spot visually.
Wrap-Up
This post covered:
Scheduling basics:
- Two decisions: “whom to” + “for how long”
- Preemptive vs Cooperative
- Evaluation criteria: Throughput, Latency, Fairness, Response, Energy
Classic algorithms:
- FCFS — convoy effect
- SJF — optimal average wait time, starvation
- RR — quantum trade-off
- Priority — starvation, foreshadows priority inversion
- MLFQ — automatic priority adjustment by observed behavior
Linux:
- O(n) → O(1) (Ingo Molnár, 2003)
- CFS (2007) — vruntime, RB-tree, completely fair
- EEVDF (2024) — eligibility + virtual deadline, latency-nice added
- Class hierarchy: stop > dl > rt > fair > idle
Windows:
- 32 priority levels (Variable 1–15, Realtime 16–31)
- Foreground quantum stretch
- Dynamic boost: I/O complete +1, Mouse/Keyboard +6, Sound +8
- Realtime has no dynamic boost — dangerous territory
macOS:
- 5 QoS classes (User Interactive ↔ Background)
- A single line decides priority + scheduling latency + I/O priority + P/E core + timer coalescing + GPU priority simultaneously
- QoS inheritance auto-mitigates priority inversion
- Game Mode (macOS 14+)
Game frame budget:
- 60fps = 16.67ms, 120fps = 8.33ms, VR 90fps = 11.11ms
- One frame: input → logic → physics → animation → render build → submit → present
- One-frame pipeline: parallelism via Main/Render lag, +1 frame input lag trade-off
- Scheduling causes of frame spike: context-switch storm, priority inversion, NUMA miss, P/E demotion
Game engine usage:
- Unity Job System, Unreal TaskGraph + Named Threads
- OS APIs: SetThreadPriority / pthread_setschedparam / pthread_set_qos_class_self_np
- Affinity: hard on Linux/Windows, hint-only on macOS
- Naughty Dog Fiber — bypasses the OS scheduler
Observation tools:
- Linux: chrt, nice, perf sched
- macOS: Instruments System Trace, sample, powermetrics
- Windows: Process Explorer, WPA / Xperf
- Cross-platform: Tracy Profiler
The next post is Part 10: Synchronization Primitives. We brushed up against priority inversion this time, and answering it requires looking at the essence of the lock first. What’s the difference between Mutex, Semaphore, and SpinLock, and why does the OS bring its own primitives — futex / SRWLock / os_unfair_lock — to the table? With those answered, we finally get close to a head-on response to Stage 2’s core question — “Why does a program with two threads writing to the same variable only crash sometimes?”
References
Textbooks
- Silberschatz, Galvin, Gagne — Operating System Concepts, 10th ed., Wiley, 2018 — Ch.5 (CPU Scheduling), Ch.6 (Synchronization)
- Tanenbaum, Bos — Modern Operating Systems, 4th ed., Pearson, 2014 — Ch.2.4 (Process Scheduling)
- Bovet, Cesati — Understanding the Linux Kernel, 3rd ed., O’Reilly, 2005 — Ch.7 (Process Scheduling, O(1) era)
- Mauerer — Professional Linux Kernel Architecture, Wrox, 2008 — Ch.2 (Process Management and Scheduling, post-CFS)
- Russinovich, Solomon, Ionescu — Windows Internals, 7th ed., Microsoft Press, 2017 — Ch.4 (Thread Scheduling)
- Singh — Mac OS X Internals: A Systems Approach, Addison-Wesley, 2006 — Ch.7 (Processes), Mach scheduler
- Gregory — Game Engine Architecture, 3rd ed., CRC Press, 2018 — Ch.8 (Multiprocessor Game Loops)
Papers
- Stoica, Abdel-Wahab, Jeffay, Baruah, Plaxton, Tan — “A Proportional Share Resource Allocation Algorithm for Real-Time, Time-Shared Systems”, RTSS 1996 — theoretical origin of EEVDF — DOI
- Pabla — “Completely Fair Scheduler”, Linux Journal, 2009 — CFS introduction — linuxjournal.com
- Molnár, Ingo — “Modular Scheduler Core and Completely Fair Scheduler [CFS]”, LKML patch series, 2007 — CFS introduction announcement
- Zijlstra, Peter — “EEVDF Scheduler”, LWN articles, 2023 — lwn.net/Articles/925371
- Anderson, Bershad, Lazowska, Levy — “Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism”, SOSP 1991 — M:N model (revisited from a scheduling angle)
- Mogul, Borg — “The Effect of Context Switches on Cache Performance”, ASPLOS 1991 — frame spike fundamentals
Official Documentation
- Linux man pages —
sched(7),chrt(1),sched_setattr(2),nice(1)— man7.org - Linux Kernel Documentation —
Documentation/scheduler/sched-design-CFS.rst,sched-eevdf.rst - Microsoft Docs — Scheduling Priorities — learn.microsoft.com
- Microsoft Docs — Priority Boosts — learn.microsoft.com
- Apple Developer — Energy Efficiency Guide for Mac Apps — Prioritize Work with Quality of Service Classes — developer.apple.com
- Apple Developer — Tuning Your Code’s Performance for Apple Silicon — developer.apple.com
- Apple Developer — Game Mode (macOS 14+) — WWDC23 “Bring your game to Mac” session
Game Development / GDC
- Gyrling, C. — Parallelizing the Naughty Dog Engine Using Fibers, GDC 2015 — gdcvault.com
- Acton, M. — Data-Oriented Design and C++, CppCon 2014 — Insomniac Games’ cache·schedule mindset
- Schreiber, B. — Multithreading the Entire Destiny Engine, GDC 2015 — Bungie’s threading model
- Boulton, M. — Threading the Frostbite Engine, GDC 2009 — DICE’s Job system
- Unity Technologies — C# Job System, Burst Compiler manual — docs.unity3d.com
- Epic Games — Task Graph System, Async Tasks in Unreal — dev.epicgames.com
- Tracy Profiler — github.com/wolfpld/tracy
Blogs / Articles
- Brendan Gregg — Linux Performance, perf sched — brendangregg.com
- Howard Oakley — The Eclectic Light Company — macOS QoS / P-E core observation series
- Fabian Giesen — Reading List on Multithreading and Synchronization — fgiesen.wordpress.com
- Raymond Chen — The Old New Thing — Windows priority boost reminiscences
- LWN.net — EEVDF, CFS group scheduling, sched_ext series
- Dmitry Vyukov — 1024cores.net — go scheduler internals
Tools
- Linux:
chrt,nice,taskset,perf sched,ftrace,bpftrace - macOS: Instruments (System Trace, Time Profiler, CPU Counters),
sample,powermetrics,dispatch_introspection - Windows: Process Explorer, Windows Performance Recorder + Analyzer, ETW, PerfView
- Cross-platform: Tracy Profiler, Optick, Superluminal
