CS Roadmap Part 9 — Scheduling: Whom Does the OS Give the CPU To?

Posted Apr 27, 2026

Intermediate

By Sehyup

32 min read

CS Roadmap Part 9 — Scheduling: Whom Does the OS Give the CPU To?

Prerequisites — Read these first

CS Roadmap Part 8 — Processes an...

TL;DR — Key Takeaways

The scheduler makes two decisions — "whom to give the CPU" and "for how long" — evaluated against Throughput, Latency, Fairness, and Response time
Linux evolved O(n) → O(1) → CFS (2007) → EEVDF (2024). CFS always picks the thread with the smallest vruntime from an RB-tree; EEVDF adds eligibility and deadline axes to handle latency-sensitive work better
Windows boosts responsiveness with 32 priority levels plus dynamic boost (foreground window, I/O completion, GUI input). macOS uses 5 QoS classes that simultaneously decide priority, P/E core placement, and power management
60fps gives 16.67ms and 120fps gives 8.33ms to finish input → logic → physics → render → present. A single priority inversion is enough to drop a frame
Unity Job priority, Unreal TaskGraph named threads, and SetThreadPriority / pthread / dispatch_qos are different abstractions sitting on the same OS scheduler

Introduction: The Question of “Who, and How Much”

The previous post showed what processes and threads are and how the OS abstracts them. A natural follow-up question arises.

If 100 threads are ready and there are only 8 cores, whom does the OS give the CPU to? And for how long?

Answering these two questions is the job of the scheduler. The answer determines two things:

Perceived responsiveness: does the click react within 0.1 seconds, or take a full second?
Frame stability: does the game hold 60fps, or occasionally exceed 17ms?

This post covers:

Scheduling basics: Preemptive vs Cooperative, the trade-offs between Throughput, Latency, Fairness, Response
Classic algorithms: FCFS, SJF, RR, Priority, MLFQ
Linux: the evolution from O(n) → O(1) → CFS → EEVDF
Windows: 32 priority levels, dynamic boost, foreground quantum stretch
macOS: QoS-based scheduling and Apple Silicon P/E core placement
Game frame budget: what fills 16.67ms and how priority inversion ruins it
How game engines use priority and affinity: Unity, Unreal, and the OS API level

The core question for Stage 2 — “Why does a program with two threads writing to the same variable only crash sometimes?” — hinges on the word “sometimes”, and that word ultimately comes out of how the scheduler operates. The order and timing of execution mixing is being decided in a place you cannot see.

Part 1: Why Scheduling Is Necessary

The Illusion of Multitasking

A desktop has a browser, IDE, Slack, Spotify, and Discord open at once. There are 8 cores, while processes and threads typically number in the hundreds. Yet they all appear to run “simultaneously.”

In reality, the OS is swapping threads in and out very rapidly. One thread holds a core for a few milliseconds, then yields to another, then to another. Swap faster than the human perception threshold (~50ms) and it feels concurrent.

The scheduler decides this swapping and answers two questions:

Whom to give it to — among threads in the Ready state, which one should be placed on a core?
For how long — once placed, when can it be taken back, or can it be?

Preemptive vs Cooperative

Preemptive: the scheduler can forcibly take a thread off the CPU. Periodic timer interrupts wake the kernel, which decides “who is next.” Almost every modern OS — Linux, Windows, macOS — uses this approach.

Cooperative: a thread runs until it voluntarily yields (yield). 80s Macs, pre-95 Windows, and some current coroutine/Fiber systems work this way. A single thread stuck in an infinite loop freezes the system — one cause of the old “Mac bomb icon.”

Hold on, let’s clarify this. Are goroutines and async/await cooperative?
Go’s goroutines are partially cooperative. Yield points exist only at function calls, channel operations, and GC safe points. Without a function call inside an infinite loop, other goroutines could starve, which is why Go 1.14 introduced async preemption. async/await is similar — yield points only at await. But the underlying threads still run on the OS’s preemptive scheduler, so the two layers are stacked.

Scheduler Evaluation Criteria

A scheduler design must consider several metrics, and they conflict with each other.

Metric	Meaning	Who likes it
Throughput	Tasks completed per unit time	Batch processing, build servers
Turnaround time	Total elapsed time per job	Compilers, data processing
Waiting time	Time spent in the Ready queue	All jobs
Response time	Input → first reaction	Desktops, games
Fairness	Even resource distribution	Multi-user systems
Predictability	Predictable latency	Real-time systems, games
Energy	Power efficiency	Mobile, laptops

Desktop and mobile typically prioritize Response time + Energy. Servers prioritize Throughput + Fairness. Real-time and games prioritize Predictability. This is why no single algorithm is optimal for every environment.

Part 2: Classic Scheduling Algorithms

Before looking at full-fledged OS schedulers, here are five textbook algorithms. Modern schedulers are all combinations or evolutions of these ideas.

FCFS (First-Come, First-Served)

The simplest. Run jobs in arrival order, and once a job starts, never preempt it (non-preemptive).

The problem is the convoy effect. If a 100-second job arrives first, all the 0.1-second jobs behind it wait 100 seconds. Average waiting time explodes.

SJF / SRTF (Shortest Job First / Shortest Remaining Time First)

Run the shortest job first. Mathematically proven to be optimal in average waiting time.

Problem 1: You must know the job length in advance. In practice you can’t, so you estimate from past execution history. Problem 2: starvation — if short jobs keep arriving, long ones may never start.

Round Robin (RR)

Cycle through the ready queue and give each thread a time quantum of CPU. When the quantum expires, send it to the back of the queue and move on.

The quantum size is the key parameter.

Too large: behaves like FCFS, responsiveness drops
Too small: context-switch overhead eats up actual work time

Typical values are 10~100ms. Linux decides dynamically (CFS); Windows uses about 6ms (12ms+ on servers).

The diagram below compares FCFS and RR for the same workload (A=8ms, B=4ms, C=2ms) arriving simultaneously.

FCFS vs Round Robin (quantum 2ms)

FCFS

A (8ms)

B (4ms)

C (2ms)

081214

avg turnaround = (8 + 12 + 14) / 3 = 11.33ms · C response 12ms

RR (q=2)

A

B

C ✓

A

B ✓

A

A ✓

02468101214

C response 4ms (1/3 of FCFS) · avg turnaround = (14 + 8 + 6) / 3 = 9.33ms

A (8ms) B (4ms) C (2ms) ✓ job completion

RR is not always better in turnaround, but it dominates in responsiveness for short jobs and in fairness.

Priority Scheduling

Assign each thread a priority and run the highest-priority one first. Within the same priority, use RR.

The problem is again starvation. Low-priority threads may never run. One remedy is aging, gradually raising the priority of threads that have waited a long time.

Another deeper problem is priority inversion. If a low-priority thread holds a lock that a high-priority thread is waiting on, while a medium-priority thread keeps preempting the low one, the result is a paradox — the high one is blocked by the medium one. The next post (synchronization) takes this on directly.

MLFQ (Multi-Level Feedback Queue)

Several queues are arranged by priority, and threads move between queues based on observed behavior.

The basic rules are:

New threads enter the highest queue
Using up the time quantum drops them down one level
Yielding to I/O before quantum expiry keeps them at the same level or moves them up

The interesting consequence:

I/O-bound work (interactive GUI, game input): short bursts then yield → stays in higher queue → fast response
CPU-bound work (compilers, encoders): long bursts → drops to lower queues → does not interfere with responsive work

The core idea is classifying jobs by behavior even when the algorithm has no idea what kind of work they are. Windows’ dynamic boost, macOS’ QoS adjustments, and Linux’s sleeper bonus are all variants of the same idea.

MLFQ was used directly by Solaris, classic Mac OS, and Windows NT. Modern OSes ostensibly use different algorithms (e.g., CFS), but their internal heuristics resemble MLFQ.

Part 3: The Linux Scheduler — O(n) → O(1) → CFS → EEVDF

The evolution of the Linux scheduler is unbeatable as a study material. The same problem has been re-solved four times, and what was wrong and how it was fixed is fully public.

O(n) Scheduler — Pre-2.4

The early Linux scheduler walked the entire ready queue for each decision. In an era of few cores and few processes, this was fine. But once servers began running thousands of processes, the scheduler itself became the bottleneck. Adding cores didn’t help due to lock contention.

O(1) Scheduler — 2.6 (Ingo Molnár, 2003)

Introduced by Ingo Molnár in late 2002.

Core ideas:

140 priority queues (real-time 0–99, normal 100–139)
A pair of active and expired queues per priority
The next thread is chosen in constant time — just find the highest set bit in a bitmap

It also added an interactive task bonus as a heuristic. Longer sleep raised priority slightly, improving desktop responsiveness. But the heuristic grew increasingly complex, and workloads were found that gamed the bonus calculation, leaving the code a patchwork.

CFS — Completely Fair Scheduler (2.6.23, 2007)

Built again by Ingo Molnár. He credited inspiration to Con Kolivas’s RSDL (Rotating Staircase Deadline) patch.

The core idea was to define “fairness” not as simple rotation but as balancing accumulated execution time. Every thread has a virtual CPU time it should have received — vruntime — and the scheduler always picks the thread with the smallest vruntime.

Virtual runtime (vruntime) is the actual runtime (runtime) corrected by weight.

\[\Delta \text{vruntime} = \Delta \text{runtime} \times \frac{w_0}{w}\]

Here $w$ is the thread’s weight (set by nice), and $w_0$ is the reference weight for nice 0 (1024). Lower nice (higher priority) means larger weight, so vruntime grows slower and the thread gets picked more often.

The data structure is a Red-Black Tree keyed on vruntime. The leftmost node (smallest vruntime) is the next runner, and insert / remove / select are all $O(\log n)$. Asymptotically slower than O(1), but in practice n is small enough that it doesn’t matter, and removing the heuristics made the code far cleaner.

The diagram below shows CFS’s core cycle. Pick the thread with the smallest vruntime from the RB-tree, run it, then re-insert it with the updated vruntime.

CFS — vruntime ordering and the run cycle

runqueue (key = vruntime, leftmost runs next)

T0

v=18

leftmost ← next

T1

v=30

T2

v=35

T3

v=42

T4

v=50

T5

v=58

T6

v=72

pick_next_task
(leftmost = T0)

→

run on CPU
vruntime += Δ × w₀ / w_T0

→

enqueue_task
re-insert in RB-tree with new v

Δvruntime = Δruntime × (w₀ / w) · w₀ = 1024 (nice 0) · nice ↓ → w ↑ → Δv ↓ → picked more often

The result: every thread's vruntime self-balances toward roughly the same value — that is what "Completely Fair" means.

CFS’s key parameters are:

sched_latency_ns — the target time to run every ready thread once per cycle (default 6ms × number of cores)
sched_min_granularity_ns — minimum runtime per slice for one thread (default 0.75ms)
sched_wakeup_granularity_ns — vruntime difference threshold required for a waking thread to preempt the current one

You can inspect them via sysctl -a | grep sched, and they auto-adjust based on core count.

  
/* Simplified CFS pick logic */
struct task_struct *pick_next_task_fair(struct rq *rq) {
    struct cfs_rq *cfs_rq = &rq->cfs;
    struct sched_entity *se = __pick_first_entity(cfs_rq);  /* RB-tree leftmost */
    return container_of(se, struct task_struct, se);
}

/* Called every tick — update vruntime and rebalance */
void update_curr(struct cfs_rq *cfs_rq) {
    struct sched_entity *curr = cfs_rq->curr;
    u64 delta_exec = now - curr->exec_start;
    curr->vruntime += calc_delta_fair(delta_exec, curr);
    /* If preemption is warranted, resched_curr() */
}

EEVDF — Earliest Eligible Virtual Deadline First (6.6, 2023~2024)

CFS served well for 16 years, but one structural issue remained: expressing latency-sensitive work was hard.

CFS could control how often a thread runs through nice, but not how quickly it must respond. The two concepts were collapsed onto the same axis.

Peter Zijlstra brought EEVDF into mainline starting in 2023, and it became the default scheduler in the 6.6 LTS kernel. The academic background is the 1996 paper by Stoica, Abdel-Wahab, Jeffay, Baruah.

EEVDF’s two axes are:

Eligibility — has this thread already run its share? If not, it’s eligible
Virtual Deadline — among eligible threads, pick the one whose deadline comes soonest

Deadline is computed as:

\[\text{deadline} = \text{eligible time} + \frac{\text{request size}}{\text{weight}}\]

A smaller request size (a new parameter called latency-nice) yields an earlier deadline, leading to more frequent preemption. In other words, work like a game’s main thread that “doesn’t run that often, but must respond instantly when it does” can finally be expressed exactly.

  
/* Linux 6.6+ : set latency-nice independently from nice */
struct sched_attr attr = {
    .sched_policy   = SCHED_NORMAL,
    .sched_nice     = 0,
    .sched_runtime  = 1   * 1000 * 1000,   /* 1ms */
    .sched_deadline = 16  * 1000 * 1000,   /* 16.67ms */
    .sched_period   = 16  * 1000 * 1000,
};
sched_setattr(pid, &attr, 0);

Even with EEVDF, vruntime-based fairness remains. EEVDF is less a replacement for CFS than a refinement of the selection policy, and external interfaces (nice, cgroup cpu.weight) are mostly unchanged.

Linux’s Other Scheduling Classes

Linux uses not a single algorithm but a layered set of classes. Each class has a priority, and lower classes don’t run while higher ones have work.

Class	Policy	Use
stop	(kernel only)	CPU hotplug, RCU, etc.
dl	SCHED_DEADLINE	Real-time (period + runtime + deadline guaranteed)
rt	SCHED_FIFO, SCHED_RR	Real-time priorities 1–99
fair	SCHED_NORMAL/BATCH/IDLE	General, CFS/EEVDF
idle	(when all idle)	swapper

Why you should not use SCHED_FIFO/RR carelessly in games: misuse can freeze the whole system — a single priority-99 infinite loop renders that core unresponsive forever after. Even genuinely RT audio threads are safer accessed via OS-provided higher abstractions like dispatch_qos / AVAudioSession.realtime rather than direct SCHED_FIFO.

Part 4: The Windows Scheduler — Priority + Boost

The Windows NT scheduler is a 32-level priority system designed by Dave Cutler based on his VMS experience. The core skeleton has been essentially unchanged since NT 3.1 (1993); over time, only heuristics and hardware-adaptive code have been layered on top.

32 Priority Levels

Level	Meaning
0	Zero page thread (for zeroing memory)
1~15	Variable priority (regular processes, subject to dynamic adjustment)
16~31	Real-time priority (admin needed, no dynamic adjustment)

Each process has a Priority Class, and within it threads are fine-tuned via Thread Priority.

  
/* Windows priority = Process Class + Thread Priority offset */
SetPriorityClass(hProcess, NORMAL_PRIORITY_CLASS);    /* base 8 */
SetThreadPriority(hThread, THREAD_PRIORITY_NORMAL);   /* offset 0 */

/* HIGH_PRIORITY_CLASS = 13, THREAD_PRIORITY_HIGHEST = +2 → effective 15 */
/* REALTIME_PRIORITY_CLASS = 24, ... */

Quantum

Windows’ time quantum is measured as a multiple of clock interval. The clock interval is typically about 15ms (HPET-based) or 1ms (when the multimedia timer is enabled).

Workstation: 2 clock intervals (default ~30ms but typically shorter after post-boot adjustments)
Server: 12 clock intervals (long quantum favors throughput)

Furthermore, the quantum is stretched for foreground processes. The thread of the window the user is currently looking at gets longer time, improving responsiveness (Control Panel → System → Advanced → Performance Options → Advanced has a “Programs” / “Background services” toggle that turns this on and off).

Priority Boost — Windows’ Core Heuristic

Threads in the variable priority range (1~15) get temporarily boosted by various events. The boost decreases by 1 each quantum until it returns to the base.

Event	Boost amount
Disk I/O complete	+1
Network / Mailslot	+2
Mouse / Keyboard input	+6
Sound card	+8
GUI thread receives a message	+2 (foreground extra)
Semaphore wait release	+1
Mutex/Event/Timer wait release	+1

This heuristic is the mechanism behind Windows’ responsiveness. Move the mouse and the GUI thread gets +6; key input gets +6 too. Even when CPU-bound work is running, input remains responsive.

Hold on, let’s clarify this. If GUI thread boost is +6, how does priority get distributed across multiple windows?
Only the focused window’s thread receives the additional foreground boost. The moment Alt-Tab changes the active window, the boost distribution changes too. Turn on the priority column in Process Explorer and click another window — you’ll see the priority of that window’s thread briefly tick up.

The Realtime Priority Trap

Levels 16~31 have no dynamic boost — they always run at that priority. In theory this means “never yields.” Audio, video capture, and some game threads use 16~22.

But using REALTIME_PRIORITY_CLASS (24~31) in regular code is dangerous. A single infinite loop at 24 or above can freeze even the mouse cursor — mouse handling is also just a thread.

NUMA, SMT, Heterogeneous

The modern Windows scheduler considers NUMA nodes, SMT (hyperthreading), and Intel Thread Director (P/E core hints) all together. Hardware Threaded Scheduling, introduced in Windows 11, takes Thread Director hints to adjust P/E core placement — what Apple Silicon does at the OS level, Intel solves through OS·CPU collaboration.

  
/* Windows: thread priority adjustment + affinity */
HANDLE h = GetCurrentThread();
SetThreadPriority(h, THREAD_PRIORITY_TIME_CRITICAL);  /* +15 */

/* Pin to cores 0 and 1 */
DWORD_PTR mask = 0x3;
SetThreadAffinityMask(h, mask);

/* Windows 10+ : E-core preference hint */
THREAD_POWER_THROTTLING_STATE state = {};
state.Version = THREAD_POWER_THROTTLING_CURRENT_VERSION;
state.ControlMask = THREAD_POWER_THROTTLING_EXECUTION_SPEED;
state.StateMask   = THREAD_POWER_THROTTLING_EXECUTION_SPEED;
SetThreadInformation(h, ThreadPowerThrottling, &state, sizeof(state));

The final THREAD_POWER_THROTTLING_EXECUTION_SPEED is an API that hints to the OS, “this thread is fine running slowly on an E-core.” Apply it to background work and the P-cores stay free for the game’s main thread.

Part 5: The macOS Scheduler — QoS + P/E Cores

macOS appears Mach-based on the surface, but its scheduler is a BSD-based priority system topped with the higher-level QoS (Quality of Service) abstraction. Developers almost always express priority through QoS, and the kernel translates it into priority + core placement + power management.

5 QoS Classes

QoS Class	Meaning	Examples	Mapped priority
User Interactive	Immediate response, what the user is directly seeing	Main thread, animation, input	47
User Initiated	User-started work the user is waiting for	File open, search	37
Default	Unspecified	General work	31
Utility	User doesn’t need the result immediately (progress shown)	Downloads, imports	20
Background	Invisible to the user	Indexing, backup	5

This QoS value determines:

CPU priority — the number above
CPU scheduling latency — User Interactive wakes fast, Background batches work
I/O priority — disk queue priority
CPU core placement (Apple Silicon) — User Interactive/Initiated prefer P-cores, Utility/Background prefer E-cores
Timer coalescing — Background batches timer firings
GPU priority — affects some graphics workloads

A single line of QoS decides all six at once.

QoS API

  
/* C/Objective-C — set the current thread's QoS */
pthread_set_qos_class_self_np(QOS_CLASS_USER_INTERACTIVE, 0);

/* When creating a GCD queue */
dispatch_queue_t q = dispatch_queue_create_with_target(
    "com.example.render",
    DISPATCH_QUEUE_SERIAL,
    dispatch_get_global_queue(QOS_CLASS_USER_INTERACTIVE, 0));

/* Attach QoS to a dispatch_async */
dispatch_async(q, ^{
    /* Runs at User Interactive */
});

  
// Swift
DispatchQueue.global(qos: .userInteractive).async {
    // Quick work to relieve the main thread
}

// Operation API
let op = BlockOperation { /* ... */ }
op.qualityOfService = .userInitiated
queue.addOperation(op)

QoS Inheritance — Priority Inversion Prevention

QoS propagates automatically. If work dispatched on a User Interactive queue does dispatch_sync on another queue, the target queue’s QoS is temporarily boosted to User Interactive. This mechanism is at the heart of macOS’s priority inversion mitigation.

The same applies to locks. os_unfair_lock raises the lock holder’s QoS to the QoS of any waiter — POSIX’s PTHREAD_PRIO_INHERIT done automatically at the OS level.

QoS → priority → P/E core mapping

The diagram below shows how a single line of QoS translates into Mach priority and Apple Silicon’s P/E core placement.

macOS QoS — six things decided in one line

QoS Class

Mach priority

Apple Silicon core placement

USER_INTERACTIVE

Main thread / animation

47

P-core only

Maximum performance / max power

USER_INITIATED

File open / search

37

P-core preferred

May use E-core if needed

DEFAULT

Unspecified

31

P/E mixed

OS decides by load

UTILITY

Progress-bar work

20

E-core preferred

Power efficiency first

BACKGROUND

Indexing / backup

5

E-core only + batched

timer coalescing, low power

What one line of QoS decides simultaneously

① CPU priority ② scheduling latency ③ I/O priority ④ P/E core placement ⑤ timer coalescing ⑥ GPU priority

QoS auto-propagates (inheritance) up dispatch chains and to lock holders, so priority inversion is mitigated automatically.

Game Mode (macOS 14+)

Game Mode, added in macOS Sonoma, is a mode the OS auto-enables when a game is fullscreen. Effects:

More aggressive QoS suppression for background work (Spotlight indexing, Time Machine, etc.)
Stronger P-core priority for the game process
Doubled audio·input polling rates for AirPods / PS5 controllers

The idea is similar to iOS’s Sustained Performance API — telling the OS, “this app cannot afford to miss 16.67ms right now,” so system-wide resource allocation adjusts.

Part 6: Game Frame Budget — 16.67ms

Now let’s transfer the OS theory above into a game context. For game developers, scheduling is ultimately about the frame budget.

The Math of the Frame Budget

\[\text{frame budget} = \frac{1000\,\text{ms}}{\text{target FPS}}\]

Target FPS	Frame budget	Who uses it
30	33.33ms	Console cinematics, some mobile
60	16.67ms	Standard games
90	11.11ms	VR minimum
120	8.33ms	High-FPS PC, PS5 Performance
144	6.94ms	High-refresh monitors
240	4.17ms	Competitive FPS, e-sports

All of the following must finish within this time:

Input handling — keyboard, mouse, gamepad, touch
Game Logic — AI, behavior, state updates
Physics / Collision — one step of discrete simulation
Animation — bone matrix computation, blending
Particle / VFX — particle updates
Render command build — draw call sorting, culling
GPU submit — command buffer queueing
Present — back buffer → screen (including VSync wait)

Game engines distribute this across multiple threads. Drawn on a timeline, what happens in one frame looks like this.

60fps frame budget 16.67ms — who does what when

Hard deadline: failing to finish within 16.67ms causes a frame drop

Main

Input

Logic / AI

Physics

Animation

Cull / Sort

Render

CommandBuffer build (frame N-1)

GPU submit + sync

Workers

AI Job

Particle

Skinning

Frustum Cull

GPU

GPU rendering (shadow → opaque → transparent → post)

VBlank

VBlank → Present

0 4.2 8.3 12.5 16.67ms

Input Logic/AI Physics Animation CommandBuffer/Cull Worker Job GPU

CPU main thread runs one frame ahead of Render — what Render sees is Main's frame N-1 output.

One-Frame Pipeline

Most engines stagger Main and Render by one frame. While Main builds the game state for frame N, Render is dispatching GPU work for the state of frame N-1. The two threads don’t touch the same data simultaneously, reducing locks, but input lag increases by one frame.

VR and e-sports titles are extremely sensitive to this trade-off. Features like NVIDIA Reflex and AMD Anti-Lag try to reduce this pipeline depth.

Frame Spike Causes — A Scheduling View

When average frametime is 11ms but occasionally spikes to 23ms (a “frame spike”), the cause is often OS scheduling itself, in addition to GC, disk I/O, and syscalls.

Context switch storm: more threads than cores → frequent swaps, cache pollution slows the main thread
Priority inversion: main thread waits for a worker’s lock while an unrelated thread keeps preempting the worker
NUMA miss: a thread migrates to a different NUMA node, exploding cache·memory latency
P/E core demotion: when macOS Game Mode isn’t applied, the game’s main thread can briefly land on an E-core, doubling frametime

Mitigations:

Pin the main and render threads to fixed cores (affinity)
Cap workers at about core count - 2 to keep cores free for main and render
macOS: use QOS_CLASS_USER_INTERACTIVE; Windows: THREAD_PRIORITY_HIGHEST (avoid TIME_CRITICAL when possible)
Mark background threads explicitly with BACKGROUND / lower priority — the OS will route them appropriately on P/E setups

A Priority Inversion Scenario

A common shape in games:

time      Main (qos=USER_INTERACTIVE)         Worker (qos=UTILITY)         Other (qos=DEFAULT)
0ms       enqueue logic                        idle                          running
1ms       AI result needed → mutex_lock(M) wait holds M                       -
2ms       (waiting)                            preempted by Other            running ← problem
... 6ms   (waiting)                            preempted by Other            running
7ms       (waiting)                            releases M                    -
7.1ms     unblocked → resumes                                                 -

Main is stuck from 1ms to 7ms while Worker can’t run either; only Other runs. macOS automatically boosts the M holder (Worker)’s QoS to USER_INTERACTIVE in this case, so Other can no longer preempt Worker. POSIX’s PTHREAD_PRIO_INHERIT and Windows’ ALPC auto-boost are similar solutions. The next post (synchronization) covers this in depth.

Part 7: How Game Engines Use Priority and Affinity

Unity — Job System Priority

Unity’s C# Job System manages an internal worker thread pool (default worker count = ProcessorCount - 1), and is scheduled via JobHandle.

  
// Unity 2022+
using Unity.Jobs;
using Unity.Collections;

[Unity.Burst.BurstCompile]
struct PhysicsStepJob : IJobParallelFor {
    public NativeArray<float3> positions;
    public NativeArray<float3> velocities;
    public float dt;

    public void Execute(int i) {
        positions[i] += velocities[i] * dt;
    }
}

void Update() {
    var job = new PhysicsStepJob {
        positions = positionsArray,
        velocities = velocitiesArray,
        dt = Time.deltaTime
    };
    JobHandle h = job.Schedule(positionsArray.Length, 64);
    h.Complete();  /* Sync with main thread — must finish in this frame */
}

Counts can be controlled via ScheduleBatchedJobs() and JobsUtility.JobWorkerMaximumCount. Player Settings → Other Settings → Use job worker count also lets you set this explicitly — on an 8 P-core + 4 E-core machine, lowering workers to 8 lets Main hold a P-core more reliably.

Unity — Application.targetFrameRate, vSyncCount

  
// Mobile, locked 60fps
QualitySettings.vSyncCount = 0;
Application.targetFrameRate = 60;

// Desktop, follow monitor refresh rate
QualitySettings.vSyncCount = 1;
Application.targetFrameRate = -1;

Unreal — TaskGraph and Named Threads

  
// Unreal: dispatch work onto a specific thread
ENamedThreads::Type Target = ENamedThreads::GameThread;  /* or RenderThread, AnyThread */
FFunctionGraphTask::CreateAndDispatchWhenReady(
    [](){ /* runs on GameThread */ },
    TStatId(),
    nullptr,
    Target);

// Parallel worker pool
ParallelFor(NumElements, [&](int32 i) {
    Process(i);
}, EParallelForFlags::None);

Unreal forces explicit serialization through named threads like GameThread, RenderThread, RHIThread. Tasks queued to the WorkerPool go into a priority queue, and the Insights tool visualizes where each task ran.

Calling OS APIs Directly

Sometimes you need to set priority directly via OS APIs, bypassing the engine.

  
// Cross-platform thread priority — common in game engine cores
void SetThreadHighPriority(std::thread& t) {
#if defined(_WIN32)
    SetThreadPriority(t.native_handle(), THREAD_PRIORITY_HIGHEST);
#elif defined(__APPLE__)
    pthread_set_qos_class_self_np(QOS_CLASS_USER_INITIATED, 0);
    /* or thread_policy_set + thread_extended_policy_data_t */
#elif defined(__linux__)
    struct sched_param p;
    p.sched_priority = 0;  /* Within SCHED_NORMAL, adjust via nice */
    pthread_setschedparam(t.native_handle(), SCHED_NORMAL, &p);
    setpriority(PRIO_PROCESS, gettid_via_syscall(), -5);
#endif
}

Thread Affinity — Pinning to Cores

  
// Linux
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(0, &set);  /* core 0 */
CPU_SET(1, &set);
pthread_setaffinity_np(pthread_self(), sizeof(set), &set);

// Windows
SetThreadAffinityMask(GetCurrentThread(), 0x3);

// macOS — affinity API is deprecated, only hints possible
thread_affinity_policy_data_t policy = { 1 /* tag */ };
thread_policy_set(pthread_mach_thread_np(pthread_self()),
                  THREAD_AFFINITY_POLICY,
                  (thread_policy_t)&policy, 1);

Hold on, let’s clarify this. Why doesn’t macOS have hard affinity?
Apple’s stance is consistent — “developers don’t know better than the OS.” The OS factors in heterogeneous P/E cores, power states, thermal limits, and core parking, so an app forcibly pinning a core often loses more than it gains. Instead, THREAD_AFFINITY_POLICY lets you hint “keep these in the same cache group”, and QoS lets you express P/E preference.

Naughty Dog’s Fiber Approach (Revisited)

Part 8 (Processes and Threads) briefly introduced Naughty Dog’s Fiber model. From a scheduling viewpoint, Naughty Dog hardly uses the OS scheduler at all.

One worker thread per core, pinned with affinity
Every thread pulls the next fiber from a fiber pool and runs it (cooperative)
Fiber switching takes tens of nanoseconds (1/100 of an OS context switch’s few microseconds)
From the OS’s view, it’s effectively “7 threads pinned to 7 cores — don’t wake them”

This is the heart of Christian Gyrling’s GDC 2015 talk. For ordinary games, this is overkill, but for AAA console titles where razor-sharp frame consistency is required, the choice is to bypass the OS and control the schedule directly.

Part 8: Hands-On Observation — Which Threads Run Where

Linux — chrt, nice, perf sched

  
# Change current shell's nice (smaller value = higher priority)
$ nice -n -5 ./mygame

# Inspect a running process's policy/priority
$ chrt -p $(pidof mygame)
pid 12345's current scheduling policy: SCHED_OTHER
pid 12345's current scheduling priority: 0

# Switch to SCHED_RR (root needed)
$ sudo chrt -r -p 50 $(pidof mygame)

# Trace scheduling events
$ sudo perf sched record -a sleep 10
$ sudo perf sched latency
# Task                       | Runtime ms | Switches | Avg delay ms | Max delay ms |
# mygame:12345               |   2543.123 |     8421 |        0.045 |        2.103 |

If Max delay exceeds 16ms, that frame likely had a frame spike.

macOS — Activity Monitor, Instruments, sample

The Instruments System Trace template is the most accurate. What it measures:

Which thread runs on each core (P0~P7, E0~E3)
Color coded by QoS class
Context switch events and their reason (preemption, voluntary block, etc.)
Thread state transitions (run / runnable / waiting / stopped)

  
# CPU usage per thread
$ top -F -R -o cpu -stats pid,command,cpu,th,state

# Sample call stacks of all threads in a process, 5 times at 1-second intervals
$ sample <pid> 5 1 -mayDie

# powermetrics for per-core utilization (P/E split)
$ sudo powermetrics --samplers cpu_power -i 1000

Windows — Process Explorer, WPA, Xperf

What the Threads tab in Process Explorer shows:

Each thread’s base/dynamic priority columns
“Stack” button for call stack
“I/O Priority”, “Memory Priority” columns (Win10+)

Xperf / Windows Performance Recorder:

  
# 1: Start profile
wpr -start GeneralProfile -filemode

# 2: Run the game, exercise the workload

# 3: Stop → collect ETL
wpr -stop trace.etl

# 4: Analyze with WPA (CPU usage by Thread, Generic Events, etc.)
wpa.exe trace.etl

The two graphs in WPA, “CPU Usage (Sampled)” and “CPU Usage (Precise)”, differ importantly. Sampled is averaged; Precise is based on context-switch events and accurate for frame spike analysis.

The Habit of Measuring

Forming the habit of measuring what the scheduler does instead of guessing is essential. Tracy Profiler embeds in your engine and visualizes every thread’s frame-by-frame activity at nanosecond granularity — both Unity and Unreal have integrated plugins.

  
// Tracy usage example
#include "Tracy.hpp"

void GameLoop() {
    ZoneScoped;  /* Auto-instrumented per function */
    {
        ZoneScopedN("AI Update");
        UpdateAI();
    }
    {
        ZoneScopedN("Physics Step");
        StepPhysics();
    }
}

Tracy also offers macros like LockableBase and FrameMark for synchronization·frame-boundary markers, which makes priority inversion easier to spot visually.

Wrap-Up

This post covered:

Scheduling basics:

Two decisions: “whom to” + “for how long”
Preemptive vs Cooperative
Evaluation criteria: Throughput, Latency, Fairness, Response, Energy

Classic algorithms:

FCFS — convoy effect
SJF — optimal average wait time, starvation
RR — quantum trade-off
Priority — starvation, foreshadows priority inversion
MLFQ — automatic priority adjustment by observed behavior

Linux:

O(n) → O(1) (Ingo Molnár, 2003)
CFS (2007) — vruntime, RB-tree, completely fair
EEVDF (2024) — eligibility + virtual deadline, latency-nice added
Class hierarchy: stop > dl > rt > fair > idle

Windows:

32 priority levels (Variable 1–15, Realtime 16–31)
Foreground quantum stretch
Dynamic boost: I/O complete +1, Mouse/Keyboard +6, Sound +8
Realtime has no dynamic boost — dangerous territory

macOS:

5 QoS classes (User Interactive ↔ Background)
A single line decides priority + scheduling latency + I/O priority + P/E core + timer coalescing + GPU priority simultaneously
QoS inheritance auto-mitigates priority inversion
Game Mode (macOS 14+)

Game frame budget:

60fps = 16.67ms, 120fps = 8.33ms, VR 90fps = 11.11ms
One frame: input → logic → physics → animation → render build → submit → present
One-frame pipeline: parallelism via Main/Render lag, +1 frame input lag trade-off
Scheduling causes of frame spike: context-switch storm, priority inversion, NUMA miss, P/E demotion

Game engine usage:

Unity Job System, Unreal TaskGraph + Named Threads
OS APIs: SetThreadPriority / pthread_setschedparam / pthread_set_qos_class_self_np
Affinity: hard on Linux/Windows, hint-only on macOS
Naughty Dog Fiber — bypasses the OS scheduler

Observation tools:

Linux: chrt, nice, perf sched
macOS: Instruments System Trace, sample, powermetrics
Windows: Process Explorer, WPA / Xperf
Cross-platform: Tracy Profiler

The next post is Part 10: Synchronization Primitives. We brushed up against priority inversion this time, and answering it requires looking at the essence of the lock first. What’s the difference between Mutex, Semaphore, and SpinLock, and why does the OS bring its own primitives — futex / SRWLock / os_unfair_lock — to the table? With those answered, we finally get close to a head-on response to Stage 2’s core question — “Why does a program with two threads writing to the same variable only crash sometimes?”

References

Textbooks

Silberschatz, Galvin, Gagne — Operating System Concepts, 10th ed., Wiley, 2018 — Ch.5 (CPU Scheduling), Ch.6 (Synchronization)
Tanenbaum, Bos — Modern Operating Systems, 4th ed., Pearson, 2014 — Ch.2.4 (Process Scheduling)
Bovet, Cesati — Understanding the Linux Kernel, 3rd ed., O’Reilly, 2005 — Ch.7 (Process Scheduling, O(1) era)
Mauerer — Professional Linux Kernel Architecture, Wrox, 2008 — Ch.2 (Process Management and Scheduling, post-CFS)
Russinovich, Solomon, Ionescu — Windows Internals, 7th ed., Microsoft Press, 2017 — Ch.4 (Thread Scheduling)
Singh — Mac OS X Internals: A Systems Approach, Addison-Wesley, 2006 — Ch.7 (Processes), Mach scheduler
Gregory — Game Engine Architecture, 3rd ed., CRC Press, 2018 — Ch.8 (Multiprocessor Game Loops)

Papers

Stoica, Abdel-Wahab, Jeffay, Baruah, Plaxton, Tan — “A Proportional Share Resource Allocation Algorithm for Real-Time, Time-Shared Systems”, RTSS 1996 — theoretical origin of EEVDF — DOI
Pabla — “Completely Fair Scheduler”, Linux Journal, 2009 — CFS introduction — linuxjournal.com
Molnár, Ingo — “Modular Scheduler Core and Completely Fair Scheduler [CFS]”, LKML patch series, 2007 — CFS introduction announcement
Zijlstra, Peter — “EEVDF Scheduler”, LWN articles, 2023 — lwn.net/Articles/925371
Anderson, Bershad, Lazowska, Levy — “Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism”, SOSP 1991 — M:N model (revisited from a scheduling angle)
Mogul, Borg — “The Effect of Context Switches on Cache Performance”, ASPLOS 1991 — frame spike fundamentals

Official Documentation

Linux man pages — sched(7), chrt(1), sched_setattr(2), nice(1) — man7.org
Linux Kernel Documentation — Documentation/scheduler/sched-design-CFS.rst, sched-eevdf.rst
Microsoft Docs — Scheduling Priorities — learn.microsoft.com
Microsoft Docs — Priority Boosts — learn.microsoft.com
Apple Developer — Energy Efficiency Guide for Mac Apps — Prioritize Work with Quality of Service Classes — developer.apple.com
Apple Developer — Tuning Your Code’s Performance for Apple Silicon — developer.apple.com
Apple Developer — Game Mode (macOS 14+) — WWDC23 “Bring your game to Mac” session

Game Development / GDC

Gyrling, C. — Parallelizing the Naughty Dog Engine Using Fibers, GDC 2015 — gdcvault.com
Acton, M. — Data-Oriented Design and C++, CppCon 2014 — Insomniac Games’ cache·schedule mindset
Schreiber, B. — Multithreading the Entire Destiny Engine, GDC 2015 — Bungie’s threading model
Boulton, M. — Threading the Frostbite Engine, GDC 2009 — DICE’s Job system
Unity Technologies — C# Job System, Burst Compiler manual — docs.unity3d.com
Epic Games — Task Graph System, Async Tasks in Unreal — dev.epicgames.com
Tracy Profiler — github.com/wolfpld/tracy

Blogs / Articles

Brendan Gregg — Linux Performance, perf sched — brendangregg.com
Howard Oakley — The Eclectic Light Company — macOS QoS / P-E core observation series
Fabian Giesen — Reading List on Multithreading and Synchronization — fgiesen.wordpress.com
Raymond Chen — The Old New Thing — Windows priority boost reminiscences
LWN.net — EEVDF, CFS group scheduling, sched_ext series
Dmitry Vyukov — 1024cores.net — go scheduler internals

Tools

Linux: chrt, nice, taskset, perf sched, ftrace, bpftrace
macOS: Instruments (System Trace, Time Profiler, CPU Counters), sample, powermetrics, dispatch_introspection
Windows: Process Explorer, Windows Performance Recorder + Analyzer, ETW, PerfView
Cross-platform: Tracy Profiler, Optick, Superluminal

AI, CS

cs os scheduler cfs eevdf qos frame-budget game-engine

This post is licensed under CC BY 4.0 by the author.

Trending Tags

Unity Mathematics Linear Algebra unity Claude Code Japanase AI Unreal LLM mathematics