Post

Render Pipeline - Every Step of How the GPU Draws the Screen

Render Pipeline - Every Step of How the GPU Draws the Screen
Prerequisites — Read these first
TL;DR — Key Takeaways
  • The render pipeline is the overall process that transforms 3D data into 2D screen pixels, and it proceeds through Application (CPU) → Geometry → Rasterization → Pixel Processing → Output Merger
  • The mathematical essence of coordinate transformation (MVP) lies in homogeneous coordinates and affine transforms, and dividing by w in perspective projection creates depth perspective
  • Rendering architectures such as Forward, Deferred, Forward+, and TBDR each have different trade-offs in light count, memory, and bandwidth
  • Modern GPU rendering (Nanite, Lumen, Mesh Shader) reconstructs the traditional pipeline in a GPU-driven manner
Visitors

Introduction

If in the post Shader Programming we discussed shaders as “programs running on the GPU,” this post delves into the stage where these shaders are executed itself. The rendering pipeline is the entire flow that transforms a 3D world into 2D pixels on a monitor.

Knowing how to write shaders and understanding why they run in certain ways at specific points within the pipeline are two different things. Understanding the rendering pipeline enables you to do the following:

Level of UnderstandingWhat You Can Do
Pipeline FlowDiagnose CPU or GPU drawcall bottlenecks
Coordinate TransformationsResolve issues like flipped normal maps and broken shadows due to coordinate system bugs
RasterizationSolve problems such as overdraw, Z-fighting, and anti-aliasing
Rendering ArchitectureChoose between Forward/Deferred rendering, optimize light performance
GPU HardwareOptimize state changes, texture cache

This post is structured based on Professor Sungwan Kim’s lecture “Core Analysis of Game Engines,” with additional practical explanations added.

Part 1: GPU Hardware and Memory

To truly understand the rendering pipeline, you need to know about the characteristics of the hardware it runs on. GPUs are processors designed fundamentally differently from CPUs.

1. GPU Internal Structure

The most direct factor influencing rendering within a GPU chip is the memory hierarchy structure.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
┌─────────────────────────────────────────────────────┐
│                     GPU Chip                         │
│                                                     │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐              │
│  │ SM/CU 0 │ │ SM/CU 1 │ │ SM/CU N │   ...        │
│  │ ┌─────┐ │ │ ┌─────┐ │ │ ┌─────┐ │              │
│  │ │Reg  │ │ │ │Reg  │ │ │ │Reg  │ │  ← Registers │
│  │ │File │ │ │ │File │ │ │ │File │ │    (Fastest) │
│  │ └─────┘ │ │ └─────┘ │ │ └─────┘ │              │
│  │ ┌─────┐ │ │ ┌─────┐ │ │ ┌─────┐ │              │
│  │ │ L1$ │ │ │ │ L1$ │ │ │ │ L1$ │ │  ← L1 Cache  │
│  │ └─────┘ │ │ └─────┘ │ │ └─────┘ │              │
│  └─────────┘ └─────────┘ └─────────┘              │
│         │            │           │                  │
│  ┌──────────────────────────────────┐              │
│  │           L2 Cache              │  ← L2 Cache    │
│  └──────────────────────────────────┘              │
│                    │                                │
│  ┌──────────────────────────────────┐              │
│  │      Fixed-Function Units       │              │
│  │  (Rasterizer, ROP, TMU etc.)     │              │
│  └──────────────────────────────────┘              │
└────────────────────│────────────────────────────────┘
                     │
          ┌──────────────────────┐
          │    VRAM (Video Memory)│  ← Slowest
          │  Textures, Buffers, G-Buffer│
          └──────────────────────┘
Memory HierarchyCapacityAccess TimePurpose
Registers~256KB/SM1 cycleShader variables, immediate operations
Shared Memory (LDS)32~128KB/SM~5 cyclesIntra-workgroup data sharing
L1 Cache16~128KB/SM~20 cyclesTexture cache, command cache
L2 Cache2~6MB~200 cyclesGlobal memory access caching
VRAM4~24GB~400+ cyclesTextures, vertex buffers, render targets

1-1. Cost of State Changes

Setting up the internal GPU cache memory takes a significant amount of time. State Change refers to changing the rendering settings on the GPU, which involves flushing and refilling the internal cache.

flowchart LR
    A["Draw Call A<br/>Material: Wood<br/>Texture: wood.png"]
    B["State Change!<br/>Cache Flush → Refill<br/>⚡ Cost Incurred"]
    C["Draw Call B<br/>Material: Stone<br/>Texture: stone.png"]

    A --> B --> C
State Change TypeCostDescription
Shader Program ChangeVery HighFull Flush of the GPU Pipeline
Render Target ChangeHighFrame Buffer Swap
Texture Binding ChangeMediumInvalidates TMU (Texture Mapping Unit) Cache
Uniform/Constant Buffer ChangeLowSmall Data Transfer
Vertex Buffer ChangeMediumReconfiguration of Input Assembler

The reason for atlas texturing, where textures are collected into a single large texture, is to minimize state changes. Frequently changing textures means continuously flushing and refilling the GPU cache memory. The GPU stores texture data in the cache using a Z-order curve (Morton code) pattern, which arranges adjacent texels close together in memory addresses to improve cache hit rates.

1
2
3
4
5
6
7
8
9
10
Texture Texel Layout (Memory Order)

Linear Layout (Inefficient):          Z-Order Layout (GPU Actual Method):
0  1  2  3                 0  1  4  5
4  5  6  7                 2  3  6  7
8  9  10 11                8  9  12 13
12 13 14 15                10 11 14 15

→ Adjacent Texels Access          → Adjacent Texels are Close in Memory
  Large Memory Address Jumps       → High Cache Hit Rate

Q. Why do Draw Calls become a performance bottleneck?

While a draw call is just a simple GPU command, the CPU must set up the rendering state (shaders, textures, buffers, etc.) for each draw call to the GPU. This CPU → GPU Command Transfer process is where the bottleneck occurs. To reduce draw calls, techniques like Batching, Instancing, and Indirect Draw are used. Low-level APIs such as DirectX 12, Vulkan, and Metal are designed to minimize this overhead.

Part 2: The Overall Structure of the Rendering Pipeline

2. Pipeline Overview

For a 3D object to become pixels on the screen, it goes through three major stages.

flowchart TB
    subgraph APP["1st Stage: Application (CPU)"]
        direction TB
        A1["Game Logic / Transform Update"]
        A2["Culling / Occlusion Culling"]
        A3["Draw Call Sorting / Batching"]
        A4["GPU Command Buffer Creation"]
        A1 --> A2 --> A3 --> A4
    end

    subgraph GEO["2nd Stage: Geometry Processing (GPU)"]
        direction TB
        B1["Input Assembly<br/>Vertex/Index Buffer Reading"]
        B2["Vertex Shader<br/>Coordinate Transformation (MVP)"]
        B3["Tessellation (Optional)<br/>Mesh Subdivision"]
        B4["Geometry Shader (Optional)<br/>Primitive Generation/Removal"]
        B5["Clipping<br/>Removing Triangles Outside Frustum"]
        B6["Screen Mapping<br/>NDC → Screen Coordinates"]
        B1 --> B2 --> B3 --> B4 --> B5 --> B6
    end

    subgraph RAS["3rd Stage: Rasterization + Pixel Processing (GPU)"]
        direction TB
        C1["Triangle Setup<br/>Edge Function Preparation"]
        C2["Triangle Traversal<br/>Generating Fragments Inside Triangle"]
        C3["Early-Z Test (Optional)<br/>Pre-Depth Testing"]
        C4["Fragment Shader<br/>Pixel Color Calculation"]
        C5["Late-Z / Depth Test<br/>Depth Comparison"]
        C6["Stencil Test<br/>Stencil Masking"]
        C7["Blending<br/>Color Mixing"]
        C8["Frame Buffer<br/>Final Output"]
        C1 --> C2 --> C3 --> C4 --> C5 --> C6 --> C7 --> C8
    end

    APP --> GEO --> RAS
StageExecution LocationProgrammer ControlCore Function
ApplicationCPUFull ControlScene Preparation, Culling, Draw Call
Geometry ProcessingGPUVertex/Tessellation/Geometry ShaderCoordinate Transformation, Clipping
RasterizationGPUFixed Functions (No Control)Triangle → Fragment Conversion
Pixel ProcessingGPUFragment ShaderFinal Color Determination
Output MergerGPUControlled by Settings (Blend Mode etc.)Depth/Stencil Testing, Blending

2-1. Application Stage (CPU)

This is a preparatory step performed by the CPU before sending draw calls to the GPU.

Culling

It involves filtering out objects that are not visible on the screen in advance to reduce GPU load.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Camera Frustum
                    Far Plane
              ┌───────────────────┐
             ╱│                   │╲
            ╱ │     Visible Area   │ ╲
           ╱  │                   │  ╲
Near Plane╱   │    ● Object A     │   ╲
 ┌───────┐    │    (Rendering!)   │    ╲
 │Camera │    │                   │     ╲
 │ ◉────→│    │                   │      ╲
 └───────┘    └───────────────────┘
           ╲                           ╱
            ╲   ○ Object B             ╱
             ╲  (Culled! Not sent to GPU) ╱
              ╲                     ╱
Culling TypeExecution LocationMethod
Frustum CullingCPUCheck if object bounding volume is inside the frustum
Occlusion CullingCPU/GPUExclude objects that are fully occluded by other objects
Backface CullingGPURemove triangles facing away from the camera
Small Triangle CullingGPURemove triangles smaller than 1 pixel on screen

Draw Call Optimization

To minimize state changes, group objects using the same material together for rendering.

1
2
3
4
5
6
Before Sorting (5 State Changes):
  Draw(ShaderA, Texture1) → Draw(ShaderB, Texture2) → Draw(ShaderA, Texture1)
  → Draw(ShaderB, Texture3) → Draw(ShaderA, Texture1)

After Sorting (2 State Changes):
  Draw(ShaderA, Texture1) × 3  → Draw(ShaderB, Texture2) → Draw(ShaderB, Texture3)

Part 3: Coordinate Systems and Transformation Pipeline

This part delves into the most mathematical and crucial aspect of the rendering pipeline. It explores how 3D object vertices are transformed into screen pixels mathematically.

3. Types of Coordinate Systems

Left-Handed/Right-Handed Coordinate Systems

1
2
3
4
5
6
7
8
9
10
11
12
13
Left-Handed Coordinate System (DirectX, Unity)        Right-Handed Coordinate System (OpenGL)

      Y ↑                              Y ↑
      │                                │
      │                                │
      │                                │
      └──────→ X                       └──────→ X
     ╱                                ╱
    ╱                                ╱
   Z (into the screen)                Z (out of the screen)

* If you make a fist with your thumb (X), index finger (Y), and middle finger (Z),
  a left hand indicates a left-handed coordinate system, and a right hand indicates a right-handed coordinate system.
PropertyLeft-Handed Coordinate SystemRight-Handed Coordinate System
Engine/WorldUnity (Y-up), Unreal (Z-up)Godot (Y-up)
Z-axis DirectionInto the screen is +ZOut of the screen is +Z
Positive Rotation DirectionCounter-clockwiseClockwise
Cross Product ResultLeft-hand ruleRight-hand rule
Camera Forward+Z (Unity), +X (Unreal)-Z

API Conventions vs Engine World Coordinate System are Separate. OpenGL traditionally uses a right-handed coordinate system for its Clip Space, but Vulkan has an NDC where the Y-axis is downward (+Y = bottom) and Z range is [0, 1], which differs from OpenGL. The API defines the Clip Space/NDC conventions, while the engine’s world coordinate handedness is determined separately by the engine. For example, DirectX itself does not enforce a coordinate system but has a convention where the DirectXMath library provides left-handed coordinate functions.

Key Point: The cross product formula remains the same regardless of the coordinate system. However, the direction of the resulting vector can be opposite depending on the coordinate system. This means that normal vectors may point in the opposite direction based on the coordinate system, which is the root cause of texture maps appearing flipped when shaders are ported between engines.

Other Coordinate Systems

Coordinate SystemPurpose
Spherical (球形)BRDF (Bidirectional Reflectance Distribution Function), Environment Mapping
Spherical Harmonics (球谐函数)Indirect Lighting Approximation, Light Probes
Cylindrical (圆柱形)Panorama Projection, Special UV Mapping
Texture Coordinate System (UV)Texture Mapping (The origin position varies by API!)

Differences in the Origin of Texture Coordinates often cause issues in practical applications.

1
2
3
4
5
6
7
8
OpenGL / Unity:              DirectX (when loading a texture):
Origin = bottom left         Origin = top left

(0,1) ──── (1,1)            (0,0) ──── (1,0)
  │          │                │          │
  │          │     ↕ Flipped    │          │
  │          │                │          │
(0,0) ──── (1,0)            (0,1) ──── (1,1)

The phenomenon of a texture appearing upside down is due to this coordinate system difference. However, Unity often automatically corrects platform-specific differences internally. If _MainTex_TexelSize.y is negative, the texture is flipped, and manual flipping of UV coordinates may be necessary for post-processing or render textures.

3-1. Homogeneous Coordinates and Affine Transformations

In 3D graphics, homogeneous coordinates are used to unify all transformations into a single matrix multiplication in order to simplify the process.

Problem: Inconsistency in Transformation Operations

1
2
3
Scaling (Scale): Expressed as multiplication → x' = s · x
Rotation (Rotation): Expressed as multiplication → x' = R · x
Translation (Translation): Expressed as addition → x' = x + t  ← Problem!

Only translation is expressed as an addition, which makes it impossible to combine multiple transformations into a single matrix chain.

Solution: Increase the Dimension by One

Extend 2D coordinates (x, y) to 3D homogeneous coordinates (x, y, 1), and 3D coordinates (x, y, z) to 4D homogeneous coordinates (x, y, z, 1).

\[\text{2D Translation:} \quad \begin{bmatrix} x' \\ y' \\ 1 \end{bmatrix} = \begin{bmatrix} 1 & 0 & t_x \\ 0 & 1 & t_y \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix}\]

Now, translation is also expressed as a matrix multiplication! This is known as an affine transformation (Affine Transformation).

\[x' = ax + by + c \quad \text{(Linear Transformation + Constant Term)}\] \[y' = dx + ey + f\]

Classification of Transformations

flowchart TB
    A["General Transformation<br/>Lines are not preserved<br/>(Fish Eye, Barrel Distortion)"]
    B["Projective Transformation (Projective)<br/>Lines are preserved, parallel lines are not<br/>(Perspective Projection)"]
    C["Affine Transformation (Affine)<br/>Parallel lines are preserved<br/>(Orthographic Projection)"]
    D["Linear Transformation (Linear)<br/>Origin is preserved<br/>(Scale, Rotation, Shear)"]
    E["Similarity Transformation (Similarity)<br/>Angles are preserved<br/>(Uniform Scale + Rotation)"]
    F["Rigid/Euclidean Transformation (Rigid/Euclidean)<br/>Lengths and angles are preserved<br/>(Translation + Rotation)"]

    A --> B --> C --> D --> E --> F
TransformationPreserved4x4 Matrix CharacteristicsExample in Games
RigidLength, AngleOrthogonal Rotation + TranslationTransform’s Position/Rotation
SimilarityAngleAdditional Uniform ScalingUniform Scale
AffineParallel LinesFirst Degree PolynomialNon-uniform Scale, Shear, Orthographic Projection
ProjectiveLinesFourth Row is not (0, 0, 0, 1)Perspective Projection (Perspective)
GeneralNoneNon-linearVR Barrel Distortion, Fish Eye Lens

Special Properties of Rotation Matrices

A rotation matrix ( R ) is an orthogonal matrix, meaning:

\[R^{-1} = R^{T}\]

This implies that the inverse and transpose are equivalent, so to find the inverse rotation, you can simply take the transpose without performing a costly inversion. Additionally, rotating in the opposite direction is achieved by negating the angle.

\[R(-\theta) = R(\theta)^{-1} = R(\theta)^{T}\]

All three operations yield the same result.

3-2. Transformation Pipeline

This is the sequence of coordinate transformations for a 3D vertex to become a screen pixel.

flowchart LR
    A["<b>Model Space</b><br/>Local coordinates of the model<br/>(Origin: Model center)"]
    B["<b>World Space</b><br/>Absolute coordinates in the scene<br/>(Origin: World center)"]
    C["<b>View Space</b><br/>Coordinates based on the camera<br/>(Origin: Camera position)"]
    D["<b>Clip Space</b><br/>Homogeneous coordinates<br/>(-w ~ +w range)"]
    E["<b>NDC</b><br/>Normalized Device Coordinates<br/>(-1 ~ +1 range)"]
    F["<b>Screen Space</b><br/>Pixel coordinates<br/>(0,0 ~ width,height)"]

    A -->|"Model (World)<br/>Matrix"| B
    B -->|"View<br/>Matrix"| C
    C -->|"Projection<br/>Matrix"| D
    D -->|"Perspective<br/>Division (÷w)"| E
    E -->|"Viewport<br/>Transform"| F

World Transformation (Model Matrix)

The standard order of SRT is: Scale → Rotation → Translation

\[M_{world} = T \cdot R \cdot S\]
1
2
3
4
5
6
7
Note the order of matrix multiplication!

DirectX (Row-major, row vectors):
  v' = v × S × R × T    ← The transformation and multiplication orders are the same.

OpenGL (Column-major, column vectors):
  v' = T × R × S × v    ← The transformation and multiplication orders are reversed!

The core reason for this difference is not related to memory layout (row/column-major) but rather how vectors are multiplied with matrices (row vector convention vs. column vector convention). In DirectX/HLSL, vectors are placed on the left (v × M), while in OpenGL/GLSL, they are placed on the right (M × v). Under row vector convention, the transformation and multiplication orders are the same; under column vector convention, they are reversed. Memory layout (row/column-major) refers to how a matrix is stored in memory and is independent of the order of multiplication.

Importance of Transformation Order

Matrix multiplication does not obey the commutative law:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
1. Rotation 90° → Move along X-axis
2. Move along X-axis → Rotate 90°

    ●                                ●
    │                                │
    │  Rotation                       │  Movement
    ▼                                ▼
    ●→→→→ ●                    ●→→→→ ●
           Movement                            │
                                          │  Rotation
                                          ▼
                                          ●

→ Different results!

View Transformation (View Matrix)

Camera transformation is the inverse of the world transformation of the camera. Moving the camera to the origin is equivalent to moving the entire world in the opposite direction of the camera.

\[V = (R \cdot T)^{-1} = T^{-1} \cdot R^{-1} = T^{-1} \cdot R^{T}\]

Using the property that the inverse and transpose of a rotation matrix are the same, we can compute this without expensive inversion operations.

Euler Angles and Gimbal Lock

When expressing rotations using three axis angles, it is called Euler Angles.

AxisNameDescription
X-axisPitchNodding head (up-down)
Y-axisYawShaking head left-right
Z-axisRollTilting head

Unity’s rotation order: Z → X → Y (order to avoid gimbal lock)

Gimbal Lock Problem: When rotating using Euler angles, if two axes align, one degree of freedom is lost.

1
2
3
4
5
6
7
8
9
10
Normal state (3 degrees of freedom):        Gimbal Lock (2 degrees of freedom):
    ┌─── Y-axis gimbal ───┐         ┌─── Y-axis gimbal ───┐
    │  ┌─ X-axis gimbal ─┐ │         │                 │
    │  │ ┌─ Z-axis ─┐  │ │   X-axis is  │  ┌─ X+Z axis ─┐  │
    │  │ │  ●    │  │ │   rotated by  │  │   ●       │  │
    │  │ └───────┘  │ │   90 degrees  │  └───────────┘  │
    │  └────────────┘ │   →            │                 │
    └─────────────────┘         └─────────────────┘
                               X-axis and Z-axis are in the same plane!
                               → One degree of freedom is lost.

This is why quaternions are needed. Just as complex numbers represent rotation in 2D, quaternions can represent rotation in 3D without losing degrees of freedom or allowing smooth interpolation (slerp).

3-3. Projection Transformation

The projection transformation is the most mathematically complex part of the pipeline.

Orthographic vs Perspective Projection

1
2
3
4
5
6
7
8
9
10
11
12
13
Orthographic Projection                     Perspective Projection

  ┌─────────┐  Far              ╲           ╱  Far
  │         │                    ╲         ╱
  │         │                     ╲       ╱
  │  Visible│                     ╲ Visible╱
  │  Area   │                      ╲ Area ╱
  │         │                       ╲   ╱
  │         │                        ╲ ╱
  └─────────┘  Near             ◉ Near  ← Camera

  → Size Constant                    → Nearby objects are larger
  → Parallel Lines Preserved        → Convergence of parallel lines at vanishing point
CharacteristicsOrthographic ProjectionPerspective Projection
Transformation TypeAffine TransformationProjective Transformation
Parallel LinesPreservedNot preserved (converge)
SizeDistance-independentInversely proportional to distance
Use Case2D Games, UI, Architectural Views3D Games, Realistic Perspective

The Core of Perspective Projection: w Division

To convert the view frustum into an NDC (Normalized Device Coordinates) cube.

1
2
3
4
5
6
7
8
9
10
11
12
View Frustum                  NDC Cube

     Far Plane                 ┌────────┐ (+1,+1,+1)
  ┌─────────────┐              │        │
  │             │              │        │
  │             │     →→→      │        │
  │             │   Projection Transform   │        │
  ╲           ╱               │        │
   ╲         ╱                └────────┘ (-1,-1,-1)
    ╲  Near ╱                   or (0 ~ 1) for DirectX
     └─────┘
      Camera

Applying the perspective projection matrix results in Clip Space coordinates. The key here is that the w component contains the original z value (depth).

\[\begin{bmatrix} x_{clip} \\ y_{clip} \\ z_{clip} \\ w_{clip} \end{bmatrix} = P \cdot \begin{bmatrix} x_{view} \\ y_{view} \\ z_{view} \\ 1 \end{bmatrix} \quad \text{where } w_{clip} = \pm z_{view} \text{ (sign depends on coordinate system convention)}\]

Sign of $w_{clip}$: In a left-handed coordinate system (DirectX, Unity), if the camera looks along +Z, then $w_{clip} = z_{view}$; in a right-handed coordinate system (OpenGL), if the camera looks along -Z, then $w_{clip} = -z_{view}$. The important point is that $w_{clip}$ is proportional to positive view distance.

During Perspective Division, all components are divided by w.

\[\begin{bmatrix} x_{ndc} \\ y_{ndc} \\ z_{ndc} \end{bmatrix} = \begin{bmatrix} x_{clip} / w_{clip} \\ y_{clip} / w_{clip} \\ z_{clip} / w_{clip} \end{bmatrix}\]

Since $w_{clip}$ is proportional to the view distance, objects farther away are divided by a larger value, making their screen coordinates smaller. This is the mathematical principle behind perspective.

API Differences in NDC Range

APIX, Y RangeZ Range
OpenGL-1 ~ +1-1 ~ +1
DirectX-1 ~ +10 ~ +1
Vulkan-1 ~ +10 ~ +1
Metal-1 ~ +10 ~ +1

Depth Precision Issue with Z Buffer

After perspective projection, z values have a non-linear distribution. There is more precision near and less at farther distances. For example, in a case where Near = 0.1 and Far = 1000, about 90% of the entire z-buffer’s precision is consumed within 10 units from the camera. This can cause Z-fighting (where distant surfaces flicker and overlap). A solution is to use Reversed-Z (Far=0, Near=1), which reverses the distribution of floating-point exponents and the non-linearity of projection, significantly improving far distance precision. While not perfectly uniform, it greatly reduces Z-fighting compared to the traditional method. Unity HDRP and UE5 default to using Reversed-Z.

3-4. Normal Vector Transformation

Coordinate points (Point) and direction vectors (Direction) require different transformations. Especially for normal vectors, extra caution is needed.

Uniform scaling does not pose a problem, but applying the model matrix directly to a normal vector in non-uniform scaling will make it no longer perpendicular to the surface.

1
2
3
4
5
6
7
8
9
Uniform Scaling (S=2,2):          Non-Uniform Scaling (Sx=2, Sy=1):

      N                           N (original)    N' (incorrect)
      ↑                           ↑              ↗
  ┌───┼───┐                   ┌───┼───────────┐
  │   │   │   → Scale ×2 →   │   │            │
  └───┴───┘                   └───┴────────────┘

  N direction OK!             N is no longer perpendicular to the surface!

Solution: Normal vectors should be transformed using the inverse transpose of the model matrix.

\[N' = (M^{-1})^{T} \cdot N\]

Proof: For two vectors to be perpendicular, their dot product must be zero. To maintain the relationship $T \cdot N = 0$ after transformation, we need:

\[T'^{T} \cdot N' = (M \cdot T)^{T} \cdot (G \cdot N) = T^{T} \cdot M^{T} \cdot G \cdot N = 0\]

Since $T^{T} \cdot N = 0$, this holds true when $M^{T} \cdot G = I$. Therefore, $G = (M^{T})^{-1} = (M^{-1})^{T}$.

Part 4: Advanced Rasterization

Rasterization is handled by the GPU’s fixed-function hardware. While it cannot be programmed directly, understanding its operation is crucial for optimization.

4. What the Rasterizer Does

flowchart LR
    A["Three Triangle Vertices<br/>(Screen Coordinates)"] --> B["Triangle Setup<br/>Edge Function Calculation"]
    B --> C["Triangle Traversal<br/>Internal Pixel Iteration"]
    C --> D["Attribute Interpolation<br/>Barycentric Coordinate Interpolation"]
    D --> E["Fragment Generation<br/>(Interpolated Data from VS)"]

Edge Functions and Triangle Interior Determination

The three sides of a triangle are defined as edge functions (Edge Function). A point is inside the triangle if its signs for all three edge functions are the same.

\[E_{01}(P) = (P_x - V_0.x)(V_1.y - V_0.y) - (P_y - V_0.y)(V_1.x - V_0.x)\]
1
2
3
4
5
6
7
      V2
     ╱  ╲
    ╱ +  ╲        E01(P) > 0  ✓
   ╱  +   ╲       E12(P) > 0  ✓
  ╱  + P +  ╲     E20(P) > 0  ✓
 ╱ +   +   + ╲    → P is inside the triangle!
V0 ──────────── V1

Barycentric Coordinate Interpolation

At each pixel within a triangle, data from vertex shader outputs (such as UV, Normal, Color) are interpolated based on the values at the three vertices.

\[\text{Attr}(P) = \alpha \cdot \text{Attr}(V_0) + \beta \cdot \text{Attr}(V_1) + \gamma \cdot \text{Attr}(V_2)\] \[\alpha + \beta + \gamma = 1\]

Here, $\alpha$, $\beta$, and $\gamma$ represent the weights indicating how close a point is to each vertex within the triangle.

Perspective-Correct Interpolation

For interpolation after perspective projection, perspective correction is essential. Simple linear interpolation in screen space distorts textures.

1
2
3
4
5
6
7
8
9
10
No Correction (Affine):              With Correction (Perspective-Correct):

  ┌──────────────┐              ┌──────────────┐
  │ ╲  ╲  ╲  ╲  │              │╲  ╲   ╲    ╲ │
  │  ╲  ╲  ╲  ╲ │              │ ╲  ╲   ╲    ╲│
  │   ╲  ╲  ╲  ╲│              │  ╲  ╲    ╲   │
  │    ╲  ╲  ╲  │              │   ╲   ╲    ╲ │
  └──────────────┘              └──────────────┘
  Textures are uniformly spaced          Spacing varies with perspective
  (PS1 Era Graphics)                 (Modern Hardware Baseline)

The corrected interpolation formula is:

\[\frac{\text{Attr}}{w} = \alpha \cdot \frac{\text{Attr}(V_0)}{w_0} + \beta \cdot \frac{\text{Attr}(V_1)}{w_1} + \gamma \cdot \frac{\text{Attr}(V_2)}{w_2}\] \[\frac{1}{w} = \alpha \cdot \frac{1}{w_0} + \beta \cdot \frac{1}{w_1} + \gamma \cdot \frac{1}{w_2}\]

The final attribute value is obtained as $\text{Attr} = \frac{\text{Attr}/w}{1/w}$.

Why textures flickered and distorted on the initial PlayStation (PS1) was due to the lack of this correction. The GTE (Geometry Transform Engine) in PS1 did not support perspective correction due to hardware limitations.

4-1. Z-Buffering and Depth Testing

Z-Buffer (Depth Buffer) stores the depth value of the closest surface rendered so far at each pixel location on the screen.

1
2
3
4
5
6
7
8
9
10
Z-Buffer Operation:

Fragment A (z=0.3) arrives → Z-Buffer[x,y] = 1.0 (initial value)
  0.3 < 1.0 → Pass! Z-Buffer[x,y] is updated to 0.3, and color of A is recorded in Color Buffer

Fragment B (z=0.5) arrives → Z-Buffer[x,y] = 0.3
  0.5 > 0.3 → Fail! (B is behind A, so discarded)

Fragment C (z=0.1) arrives → Z-Buffer[x,y] = 0.3
  0.1 < 0.3 → Pass! Z-Buffer[x,y] is updated to 0.1, and color of C is recorded in Color Buffer
Z-Buffer Bit SizePrecisionTypical Use
16-bit65,536 stepsMobile (Memory Saving)
24-bit16,777,216 stepsMost Games (Standard)
32-bit floatUltra High PrecisionLarge Worlds, Reversed-Z

Z-Fighting Phenomenon

When two surfaces are extremely close to each other within the Z-Buffer precision, the winner can change frame by frame causing a flickering effect.

1
2
3
4
5
6
7
8
9
10
Z-fighting Occurrence:
  Surface A: z = 0.500001
  Surface B: z = 0.500002
  → Indistinguishable within Z-Buffer precision → A and B are alternately visible each frame

Solutions:
1. Reduce Near/Far ratio (increase the Near value)
2. Use Polygon Offset (glPolygonOffset / Depth Bias)
3. Use Reversed-Z (improve precision for distant objects)
4. Logarithmic Depth Buffer (Logarithmic Depth Buffer)

4-2. Anti-Aliasing

Rasterization maps continuous triangles onto a discrete pixel grid, inevitably leading to aliasing.

TechniquePrincipleAdvantagesDisadvantages
MSAA (Multi-Sample Anti-Aliasing)Edge detection using multiple sample points per pixelHigh quality, effective at triangle edgesHigh memory/bandwidth cost, incompatible with Deferred rendering
FXAA (Fast Approximate Anti-Aliasing)Post-processing edge detection followed by blurringFast and lightweightSlightly blurry textures
TAA (Temporal Anti-Aliasing)Frame-by-frame subpixel jitter + history blendingTemporal stability, solves shader aliasingGhosting (trailing), motion blur
DLSS/FSR (Deep Learning Super Sampling/Fidelity Scaling)AI upscalingHigh-quality at low resolutionsRequires dedicated hardware/algorithms

MSAA (Multi-Sample Anti-Aliasing) Operation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Without MSAA (1x):              With MSAA 4x:

  ┌─┬─┬─┬─┐                  ┌─┬─┬─┬─┐
  │○│○│○│ │                  │◉│◑│◔│ │  ← Edge pixels are partially covered
  ├─┼─┼─┼─┤                  ├─┼─┼─┼─┤    → Intermediate color blending
  │○│○│ │ │                  │◉│◉│◔│ │
  ├─┼─┼─┤ │                  ├─┼─┼─┤ │
  │○│ │ │ │                  │◑│◔│ │ │
  └─┴─┴─┴─┘                  └─┴─┴─┴─┘
  Staircase is clear         Edges are smooth

  ○ = Fully covered           ◉ = 4/4 coverage (100%)
                             ◑ = 2/4 coverage (50% blend)
                             ◔ = 1/4 coverage (25% blend)

Part 5: Rendering Architecture

Even within the same pipeline, the rendering architecture can vary significantly based on when and how lighting calculations are performed. This is a crucial decision that defines the performance characteristics of a game project.

5. Forward Rendering

This is the oldest and most basic approach. It performs lighting calculations simultaneously while rendering each object.

flowchart LR
    subgraph Per_Object["Object-by-Object"]
        A["Vertex<br/>Shader"] --> B["Rasterize"] --> C["Fragment Shader<br/>Texture + Lighting Calculation<br/>(All Lights Once)"]
    end
    C --> D["Frame Buffer<br/>(Immediate Recording)"]
1
2
3
4
5
6
7
Forward Rendering's Drawbacks — Overdraw:

Number of Fragment Shader Executions = Number of Objects × Number of Lights × Overlapping Pixels

Example: 3 objects overlapping in the center of the screen (4 lights)
  → The Fragment Shader runs 3 × 4 = 12 times at the same pixel!
  → Only the front object is visible, but all back objects are fully calculated — a waste.
AdvantagesDisadvantages
Simple and Intuitive ImplementationPerformance Degrades Quickly with Many Lights (O(objects × lights))
Natural Handling of Transparent ObjectsWasted Shading When Overdraw Occurs
Supports MSAARealistically Requires Limiting the Number of Lights
Low Memory Usage 
Essential for VR (Stereo Rendering) 

5-1. Deferred Rendering (Deferred Rendering)

Core Idea: Shading (lighting calculations) is deferred until later. First, only geometric information is stored in the G-Buffer, and then lighting calculations are performed for only the pixels that are visible on the screen.

flowchart LR
    subgraph Pass1["Pass 1: Geometry Pass"]
        A["All objects<br/>rendered"] --> B["Geometric info saved to<br/>G-Buffer"]
    end

    subgraph GBuf["G-Buffer (multiple render targets)"]
        G1["Albedo (RGB)"]
        G2["Normal (RGB)"]
        G3["Depth (R)"]
        G4["Roughness/Metallic"]
    end

    subgraph Pass2["Pass 2: Lighting Pass"]
        C["Full-screen Quad<br/>Read from G-Buffer per pixel"] --> D["Lighting calculations<br/>(only visible pixels!)"]
    end

    Pass1 --> GBuf --> Pass2
    D --> E["Frame Buffer"]
1
2
3
4
5
6
7
8
9
10
11
12
13
Example of G-Buffer setup (4 render targets):

RT0 (Albedo + Alpha):     RT1 (Normal):
┌──────────────────┐      ┌──────────────────┐
│ R │ G │ B │ A    │      │ Nx │ Ny │ Nz │ - │
│ Surface base color│      │ World normal vector│
└──────────────────┘      └──────────────────┘

RT2 (Motion + Specular):   Depth Buffer:
┌──────────────────┐      ┌──────────────────┐
│ Mx │ My │ Spec │ │      │    Depth (24bit) │
│ Motion vector  │ Specular│      │    Depth value         │
└──────────────────┘      └──────────────────┘

Why is it faster?

1
2
3
4
5
6
Forward: Lighting calculations for every fragment (including those not visible)
  Pixel X with three objects overlapping → 3 lighting calculations

Deferred: Only perform lighting calculations for fragments that are visible on the screen
  Pixel X with three objects overlapping → only the frontmost object remains in G-Buffer
  → 1 lighting calculation!
AdvantagesDisadvantages
Robust to light count (O(pixels × lights))High memory usage due to G-Buffer
Shading only visible pixels → efficientNo support for transparent objects (requires separate forward pass)
Possible to perform lighting culling based on light volumesMSAA incompatible (replaced by TAA)
 Bandwidth bottleneck in G-Buffer at high resolutions
 Difficult to use on mobile due to memory/bandwidth constraints

Why don’t transparent objects work?: The G-Buffer stores only one geometric piece of information per pixel. Transparent objects require information from the object behind them, but this is lost when only the frontmost object remains in the G-Buffer. Therefore, deferred engines handle transparency using a hybrid approach where opaque objects are processed with deferred rendering and transparent objects are handled with forward rendering.

5-2. Forward+ (Tiled Forward) Rendering

This is the most recent approach that combines the advantages of Forward and Deferred rendering.

flowchart TB
    subgraph Phase1["Phase 1: Depth Pre-pass"]
        A["Render all objects' depth only first"]
    end

    subgraph Phase2["Phase 2: Light Culling (Compute Shader)"]
        B["Divide the screen into tiles (e.g., 16×16)"]
        C["Generate a list of lights affecting each tile"]
        B --> C
    end

    subgraph Phase3["Phase 3: Forward Shading"]
        D["When rendering each object, refer to<br>only the light list for that tile"]
    end

    Phase1 --> Phase2 --> Phase3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Light Heatmap (Light Density Visualization):

  ┌────┬────┬────┬────┐
  │ 2  │ 3  │ 5  │ 2  │  ← Number of lights affecting each tile
  ├────┼────┼────┼────┤
  │ 1  │ 8  │ 12 │ 4  │  ← Lights concentrated in the center tile
  ├────┼────┼────┼────┤
  │ 1  │ 6  │ 9  │ 3  │
  ├────┼────┼────┼────┤
  │ 0  │ 2  │ 3  │ 1  │
  └────┴────┴────┴────┘

→ Each tile calculates only the lights affecting its area
→ A tile with 12 lights processes only 12, while a tile with no lights processes none
Forward+ForwardDeferred
Robust to light countSensitive to light countRobust to light count
Supports transparencySupports transparencyDoes not support transparency
Supports MSAASupports MSAADoes not support MSAA
No need for G-BufferNo need for G-BufferRequires G-Buffer
Needs Compute ShaderNo additional passes neededRequires additional Geometry Pass

5-3. TBDR: Tile-Based Deferred Rendering

This is a hardware-level rendering method used in mobile GPUs (such as ARM Mali, Qualcomm Adreno, and Apple GPUs). It differs from software-based deferred rendering.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Existing GPU (Immediate Mode Rendering):
  Process one triangle at a time → Immediately record to VRAM
  → High memory bandwidth usage

Mobile GPU (Tile-Based Deferred Rendering - TBDR):
  ┌────┬────┬────┬────┐
  │ T0 │ T1 │ T2 │ T3 │  ← Divide the screen into tiles (usually 16×16 ~ 32×32)
  ├────┼────┼────┼────┤
  │ T4 │ T5 │ T6 │ T7 │
  ├────┼────┼────┼────┤
  │ T8 │ T9 │T10 │T11 │
  └────┴────┴────┴────┘

  Process each tile within the GPU chip's on-chip memory
  → Only record to VRAM after processing a tile is complete
  → Significantly reduce memory bandwidth usage → Reduce power consumption
FeatureImmediate Mode Rendering (IMR, Desktop GPUs)Tile-Based Deferred Rendering (TBDR, Mobile GPUs)
Rendering UnitPer trianglePer tile
Memory AccessVRAM per fragmentOnly when a tile is complete
BandwidthHighLow
Power ConsumptionHighLow
Hidden Surface RemovalEarly-ZHardware HSR (more efficient)

When optimizing for mobile, TBDR must be considered. Switching render targets in TBDR requires flushing the current tile to VRAM and reloading data for the new render target. This is why reducing render passes is crucial in mobile optimization. Unity URP’s “Single Pass” rendering design takes this into account.

Part 6: The Physics of Light and Materials

The final output of the rendering pipeline calculates “how much light reaches a pixel, and what color it is.” Understanding the physical properties of light helps in comprehending why parameters are designed as they are in PBR shaders.

6. Properties of Light

Light is both a wave and a particle (wave-particle duality). Both properties are utilized in rendering.

PropertyPhysical PhenomenonRendering Application
Reflection (反射)Incident angle = Reflection angleSpecular, Environment Map Reflection
Refraction (折射)Direction change of light at material boundariesGlass, Water, Diamond
Absorption (吸收)Energy → HeatSurface unique color, Transparency Fading
Scattering (散射)Spread upon hitting particlesSSS (Skin), Sky Blue, Fog
Interference (干涉)Enhancement or cancellation of wavesSoap Film Rainbow, Anti-Reflective Coating
Diffraction (衍射)Bending around obstaclesAlmost ignored in rendering

Law of Energy Conservation

When light hits a surface, the sum of reflected, absorbed, and transmitted energies cannot exceed the incident energy.

\[E_{reflected} + E_{absorbed} + E_{transmitted} = E_{incident}\]

The classical Phong model disregarded this law by simply adding Diffuse + Specular + Ambient. However, PBR (Physically Based Rendering) strictly adheres to this law. The reason why Diffuse becomes 0 when Metallic is 1 is due to this principle. Metals consume almost all of the light energy in specular reflection, leaving no energy for diffuse allocation.

6-1. Fresnel Effect

This is an effect where the reflectance changes depending on the viewing angle, and it can be easily observed in everyday life.

1
2
3
4
5
6
7
8
9
Looking at a lake:

Looking straight down (vertical):        Looking from an angle:
     Eye                           Eye
     ↓                            ╲
     ↓                             ╲
  ~~~~~~~~~~~                 ~~~~~~~~~~~~~~
  Water is clearly visible       Sky is reflected
  (Low reflectance, ~2%)         (High reflectance, ~100%)

Fresnel Equation and Schlick Approximation

Although the exact Fresnel equation can be complex, in real-time rendering, Schlick’s approximation is used.

\[F(\theta) = F_0 + (1 - F_0)(1 - \cos\theta)^5\]

Here, $F_0$ represents the reflectance at normal incidence (0 degrees), and $\theta$ is the angle between the view direction and the surface normal.

Material$F_0$ ValueCharacteristics
Water0.02Nearly transparent; reflection visible when viewed obliquely
Glass0.04Basic value for non-metallic materials
Plastic0.04Similar to most non-metals
Gold(1.0, 0.71, 0.29)Colored reflection!
Silver(0.95, 0.93, 0.88)Nearly white reflection
Copper(0.95, 0.64, 0.54)Orange reflection

The Specular Reflection of Metals Has Color Because: Metals selectively absorb light of certain wavelengths and reflect the rest. Gold appears yellow because it absorbs blue light.

Snell’s Law (Refraction)

\[n_1 \sin\theta_1 = n_2 \sin\theta_2\]
1
2
3
4
5
6
7
8
9
10
Incident Light         Reflected Light
  ╲    θ₁  ╱
   ╲   │  ╱
    ╲  │ ╱
─────╲─│╱─────── Interface (Air → Water)
      ╲│╱
       │╲
       │ ╲  θ₂
       │  ╲
       Refracted Light
MediumRefractive Index (n)
Vacuum1.0
Air1.003
Water1.33
Glass1.5
Diamond2.42

When light refracts, the refractive index varies with wavelength (dispersion). This is the principle behind rainbows appearing in prisms.

6-2. BRDF (Bidirectional Reflectance Distribution Function)

BRDF (Bidirectional Reflectance Distribution Function) is a function that defines “how much light coming from a specific direction is reflected in another specific direction.” It forms the mathematical foundation of PBR.

\[f_r(\omega_i, \omega_o) = \frac{dL_o(\omega_o)}{dE_i(\omega_i)}\]
  • $\omega_i$: Incident Light Direction
  • $\omega_o$: Reflected Light (View) Direction
  • $L_o$: Reflectance Radiance
  • $E_i$: Incident Irradiance

Microfacet Theory

In PBR, surfaces are assumed to be composed of countless microfacets on a microscopic scale. Each microfacet acts like a perfect mirror, with its normal direction randomly distributed.

1
2
3
4
5
6
7
8
Rough Surface (High Roughness):          Smooth Surface (Low Roughness):

    ↗  ↑  ↖  ↗  ↑                    ↑  ↑  ↑  ↑  ↑
  ╱╲╱╲╱╲╱╲╱╲╱╲╱╲╱╲               ───────────────────
  Microfacet directions vary        Microfacets have nearly identical directions
  → Light is scattered in many     → Light is concentrated and reflected in one direction
     directions                     → Wide, diffuse highlights
  → Broad and soft highlights       → Sharp and bright highlights

Cook-Torrance BRDF (PBR Specular):

\[f_{spec} = \frac{D(\vec{h}) \cdot F(\vec{v}, \vec{h}) \cdot G(\vec{l}, \vec{v}, \vec{h})}{4 \cdot (\vec{n} \cdot \vec{l}) \cdot (\vec{n} \cdot \vec{v})}\]
ItemNamePhysical Meaning
D (NDF)Normal Distribution FunctionThe ratio of microfacet normals facing the Half vector direction. Higher Roughness results in a wider distribution.
F (Fresnel)Fresnel TermChange in reflectance due to viewing angle.
G (Geometry)Geometry/Shadow-MaskingRatio of microfacets occluding or casting shadows on each other.

The reason GGX (Trowbridge-Reitz) is the industry standard for the D term: It provides a long-tailed distribution compared to existing Beckmann or Phong NDFs, which results in natural glowing highlights around the highlight. This better matches the highlight patterns observed in reality.

6-3. Scattering and Atmospheric Effects

The phenomenon where light spreads out when it hits particles in the air is called scattering.

Type of ScatteringParticle Size vs WavelengthCharacteristicsPhenomenon
Rayleigh ScatteringParticle « WavelengthShorter wavelengths (blue) are scattered moreBlue sky, red sunset
Mie ScatteringParticle ≈ WavelengthAll wavelengths scatter similarlyWhite clouds, fog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Principle of a blue sky (Rayleigh scattering):

  Sun ────→ [Air molecules in the atmosphere] ────→ Eye
                   │
                   │ Blue light is more
                   │ scattered
                   ↓
               Light spreads out in all directions
               → Sky appears blue

  Sunset/Sunrise:
  Sun ─────────────────────→ Long path ──→ Eye
                                        Blue light has already been scattered
                                        Only red light reaches
                                        → Sky appears red

SSS (Subsurface Scattering)

This is the phenomenon where light enters below a surface, scatters internally, and then exits. It can be observed in phenomena like skin (appearing red when backlit), marble, and wax. For realistic human skin rendering, SSS is essential, and UE5’s Subsurface Profile and Unity HDRP’s Diffusion Profile support it.

Part 7: Global Illumination (Global Illumination)

Direct light alone cannot create a realistic scene. In reality, indirect light, where light reflects off one surface to illuminate another, is abundant. Simulating this indirect light is what GI (Global Illumination) is about.

7. Direct Light vs Indirect Light

1
2
3
4
5
6
7
8
9
10
11
12
Direct light only (GI none):          Direct light + indirect light (GI present):

    ☀️ Sun                           ☀️ Sun
     ╲                               ╲
      ╲                               ╲
  ┌────╲────┐                    ┌────╲────┐
  │  Bright │                    │  Bright │
  │         │                    │    ↘    │
  │ Full    │                    │ Slightly│
  │ Black   │ ← Shadow area      │ Bright  │ ← Light reflected from the wall
  └─────────┘                    └─────────┘
  Unrealistically dark            Natural brightness variation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
### 7-1. Global Illumination Techniques

| Technique | Principle | Real-time? | Quality | Usage |
| --- | --- | --- | --- | --- |
| **Lightmap Baking** | Pre-calculated and stored in a texture | Pre-rendered | High (Static) | Static environments (Unity Lightmaps) |
| **Light Probe** | SH coefficients saved at multiple points in space | Pre-rendered | Medium | Indirect lighting of dynamic objects |
| **SSAO ⚠️** | Occlusion approximation in screen space | Real-time | Low-Medium | Narrow shadows (Ambient Occlusion) |
| **SSR ⚠️** | Screen-space reflection tracing | Real-time | Medium | Floor reflections, water surfaces |
| **Ray Tracing** | Ray tracing | Real-time (HW) | High | RTX, DXR supported GPUs |
| **Radiosity** | Simulation of diffuse reflection only | Pre-rendered | High (Diffuse) | Architectural visualization |
| **Path Tracing** | Tracing all light paths | Offline/Real-time | Best | Film, UE5 Path Tracer |
| **Lumen (UE5)** | SDF + Screen Trace + HW/SW RT hybrid | Real-time | High | UE5 default GI |

> ⚠️ **SSAO and SSR are technically not true Global Illumination (GI) techniques but rather screen space auxiliary techniques.** SSAO is an Ambient Occlusion technique, while SSR provides reflection effects. True GI simulates indirect lighting through multiple reflections of light, which includes techniques like Lightmaps, Light Probes, Ray Tracing, and Lumen. However, since SSAO/SSR are often used alongside the GI pipeline for additional effects, they have been included in this table.
{: .prompt-info}

### 7-2. Ray Tracing

This is a method of **tracing rays (light rays) from the camera to track where these rays intersect with surfaces**.

Camera ──→ Ray ──→ Surface A (Intersection Point) │ ├──→ Shadow Ray → Light Source (Shadow Determination) │ ├──→ Reflection Ray → Surface B (Reflection) │ └──→ Refraction Ray → Surface C (Refraction)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
It utilizes the reversibility of light (angle of incidence equals angle of reflection). Instead of shooting an infinite number of rays from the light source, **tracing back from the camera** allows us to calculate only the rays that appear on the screen.

### Why Spheres Are Easiest to Represent in Ray Tracing

The intersection point between a ray and a sphere can be easily calculated using the **quadratic formula**:

$$
\|\vec{O} + t\vec{D} - \vec{C}\|^2 = r^2
$$

This expands into a quadratic equation for $t$:

$$
at^2 + bt + c = 0 \quad \text{where}
$$

$$
a = \vec{D} \cdot \vec{D}, \quad b = 2\vec{D} \cdot (\vec{O} - \vec{C}), \quad c = (\vec{O} - \vec{C}) \cdot (\vec{O} - \vec{C}) - r^2
$$

Using the discriminant $\Delta = b^2 - 4ac$:
- $\Delta < 0$: No intersection
- $\Delta = 0$: Tangent (1 intersection point)
- $\Delta > 0$: Penetration (2 intersection points, use the closer one)

In contrast, checking for intersections with a triangle mesh requires performing Ray-Triangle Intersection checks for each triangle. Therefore, **acceleration structures like BVH (Bounding Volume Hierarchy)** are essential. Hardware ray tracing (RTX) processes this BVH traversal in a specialized unit called the RT Core at high speed.

### 7-3. Hybrid Rendering (The Standard of Today)

Current real-time rendering is the standard as a **rasterization + ray tracing hybrid**.

Hybrid Rendering Pipeline:

┌─────────────────────────────────────────┐ │ Rasterization (Basic Scene Rendering) │ │ → Opaque Objects, Basic Shading │ └──────────────────┬──────────────────────┘ │ ┌────────────────┼────────────────┐ │ │ │ ▼ ▼ ▼ Ray Tracing Ray Tracing Ray Tracing Shadows Reflections Global Illumination (Shadow Rays) (Reflection Rays)(Diffuse Bounces) │ │ │ └────────────────┼────────────────┘ │ ▼ Compositing (Composite) │ ▼ Post-Processing (Post-Process)

1
2
3
4
5
6
7
8
9
10
11
This approach **rasterizes basic geometry quickly** and **uses ray tracing to enhance accuracy for shadows, reflections, and global illumination**.

## Part 8: Evolution of Modern Rendering Pipelines

Traditional rendering pipelines have remained largely unchanged for decades. However, in recent years, there has been a paradigm shift towards **GPU-Driven Rendering**.

### 8. GPU-Driven Rendering

In the traditional pipeline, **the CPU decides "what to render" and issues commands to the GPU**. In the GPU-Driven approach, **the GPU itself decides "what to render"**.

Traditional Pipeline: CPU: Culling → Sorting → Batching → Draw Call Generation → Send to GPU GPU: Executes received commands

GPU-Driven Pipeline: CPU: Uploads all scene data to the GPU once GPU: Uses Compute Shaders for culling and sorting clusters/meshes → Executes Indirect Draw on its own → Minimizes CPU intervention!

1
2
3
4
5
6
7
8
9
10
11
12
| Traditional | GPU-Driven |
| --- | --- |
| CPU handles object-level culling | GPU uses Compute Shaders for cluster/mesh-level culling |
| Object-by-object draw calls | Uses Indirect Draw to minimize draw calls |
| CPU bottleneck is common | Distributes bottlenecks through GPU operations |
| Limited to thousands of objects | Capable of handling millions of objects |

### 8-1. Nanite (UE5 Virtual Geometry)

Unreal Engine 5's Nanite is a representative implementation of GPU-Driven Rendering.

Traditional LOD: Nanite:

Select pre-made models based on distance and divide them into clusters (128 triangles) of meshes to be rendered. Cluster visibility is determined in real-time by the GPU. Only visible clusters are rasterized. LOD 0: ████ (10K tris) LOD 1: ██ (1K tris) ┌──┬──┬──┬──┐ LOD 2: █ (100 tris) │CL│CL│ │CL│ ← Only visible clusters are ├──┼──┼──┼──┤ rendered. → Pop-in effect │ │CL│CL│ │ → LOD creation cost └──┴──┴──┴──┘ → Mesh quality limitations → No pop-in, continuous LOD → Real-time processing of billions of polygons

1
2
3
4
5
6
7
8
9
10
11
12
13
Nanite's core pipeline:

1. **Cluster-based subdivision**: Divide meshes into clusters of approximately 128 triangles.
2. **GPU culling**: Perform occlusion culling using Compute Shaders.
3. **Software rasterization**: Directly rasterize small triangles in Compute Shaders.
4. **Hardware rasterization**: Use the existing pipeline for larger triangles.
5. **Visibility Buffer**: Store only triangle IDs instead of G-Buffer → Material shading is done at the end.

### 8-2. Mesh Shader (DirectX 12 Ultimate / Vulkan)

This is a new pipeline stage that **completely replaces** the traditional Vertex Shader → Geometry Shader flow.

Traditional Pipeline: Input Assembly → Vertex Shader → [Tessellation] → [Geometry Shader] → Rasterizer

Mesh Shader Pipeline: [Amplification Shader] → Mesh Shader → Rasterizer (= Task Shader)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| Traditional | Mesh Shader |
| --- | --- |
| Vertex Unit Processing | Meshlet Unit Processing |
| Fixed Input Format | Free Data Access |
| Poor Geometry Shader Performance | Workgroup-Based, Efficient |
| LOD/Culling on CPU | LOD/Culling on GPU for Meshlets |

Mesh Shader can be seen as a hardware standard for implementing systems like Nanite.

> **Nanite's Limitations**: Nanite only works in the Deferred Rendering Path and is not compatible with Forward Rendering or MSAA. Support for VR (stereo rendering) is limited, and there are restrictions on transparent materials and certain material features such as world position offset.
{: .prompt-warning}

### 8-3. Lumen (UE5 Real-Time GI)

Lumen is a real-time global illumination system that combines **Software Ray Tracing (SDF Tracing), Screen Space Tracing, and Hardware Ray Tracing (HW RT)** based on the situation. When HW RT is enabled, it can provide more accurate results in the final gather.

Lumen GI Pipeline:

  1. Scene SDF (Signed Distance Field) Generation → Store distance information around each mesh in a 3D texture

  2. Screen Space Trace (Close Range) → Fast ray matching in the visible area of the screen

  3. SDF Trace - Software Ray Tracing (Medium/Long Range) → Quickly trace rays following the SDF → Much faster than polygon intersection checks

3-1. (Optional) Hardware Ray Tracing → Enabled on GPUs with RT cores → Provides more accurate reflections/GI in final gather

  1. Radiance Cache → Place probes throughout the scene for indirect lighting caching

  2. Final Compositing → Direct light + Indirect Light (Diffuse GI + Specular Reflection) ```

The core of Lumen is to automatically adjust the trade-off between accuracy and performance. It calculates closer areas with higher accuracy and farther areas with lower accuracy.

Part 9: Comparison of Engine Rendering Pipelines

9. Main Engine Comparison

The following comparison is based on Unity 6 (6000.x) / Unreal Engine 5.4+ / Godot 4.3+. The supported range may vary depending on the engine version.

ItemUnity URPUnity HDRPUnreal Engine 5Godot 4
Rendering PathForward / Forward+ / DeferredDeferred (default), ForwardDeferred (default), ForwardForward+ (Vulkan), Forward (GLES3)
GILight Probe, Lightmap, APVSSGI, APV, Lightmap, Screen Space ReflectionLumen (SDF+HW RT hybrid)Lightmap, SDFGI, VoxelGI
LODManual LODManual LODNanite (Automatic)Manual LOD
TransparencyForwardForward PassForward (separate)Forward
AAFXAA, MSAA, TAA, STPTAA, MSAATSR (TAA-based), FXAA, MSAA (Forward only)MSAA, TAA, FXAA
MobileOptimizedNot supportedLimitedGLES3 support
VRForward recommendedForward optionForward optionForward
Shader LanguageHLSL (ShaderLab)HLSL (ShaderLab)HLSL (UE Macros)GLSL-based
Graphics APIDX11/12, Vulkan, MetalDX12, VulkanDX11/12, VulkanVulkan, GLES3

Reason for Recommending Forward in VR

VR requires stereo rendering (different images for the left and right eyes). Deferred rendering needs to create a G-Buffer for each eye, which increases memory and bandwidth costs. Since Forward does not use a G-Buffer, this additional cost is avoided. Additionally, anti-aliasing is more noticeable in VR, making MSAA important. Traditional deferred rendering finds it difficult to directly apply MSAA at the G-Buffer stage. However, this is not an absolute constraint; time-based anti-aliasing methods like UE5’s TSR or Unity HDRP’s TAA can be used as a complement. Therefore, while using deferred rendering in VR is possible depending on the engine and platform, it would be more accurate to say that “Forward recommended” rather than “Forward required.”

Part 10: Rendering Pipeline Profiling

10. Bottleneck Diagnosis

To diagnose rendering performance issues, it is first necessary to identify where the bottleneck lies.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Rendering Pipeline Bottleneck Diagnosis Flow:

                  ┌──────────────────┐
                  │      Low FPS!      │
                  └────────┬─────────┘
                           │
                  ┌────────▼─────────┐
                  │ CPU Time > GPU Time? │
                  └────────┬─────────┘
                     Yes ╱     ╲ No
                   ╱                ╲
          ┌────────▼────┐    ┌──────▼──────────┐
          │  CPU Bottleneck   │    │    GPU Bottleneck       │
          │ • Excessive Draw Calls│    │ • Vertex Bottleneck?   │
          │ • Poor Culling     │    │ • Fragment Bottleneck?  │
          │ • Script GC       │    │ • Bandwidth Bottleneck?    │
          └─────────────┘    └──────────────────┘
Bottleneck TypeSymptomsSolutions
CPU (Draw Calls)Low GPU Utilization but Low FPSBatch, Instancing, SRP Batcher
Vertex ProcessingSlower on high-polygon meshesLOD, Mesh Optimization, Culling
RasterizationSlow with large triangles filling the screenOcclusion Culling, Small Triangle Culling
Fragment (Texture)Excessive use of high-resolution texturesMipmaps, Texture Compression, Streaming
Fragment (ALU)Overuse of complex shadersShader Simplification, LOD Shaders
Fill RateExcessive particles/transparencyReduce Overdraw, Optimize Particles
BandwidthExcessive G-Buffer usage, high resolutionReduce Resolution, Data Packing

Profiling Tools

ToolEngineKey Features
Frame DebuggerUnityDraw call order, render target visualization
RenderDocUniversalFrame capture, shader debugging
GPU VisualizerUnrealPer-pass GPU timing
NVIDIA NsightUniversalGPU profiling, warp analysis
Xcode GPU ProfilerMetal/AppleMetal shader profiling
PIXDirectX/XboxShader debugging, timing

Conclusion

The rendering pipeline is ultimately the “systematic process of converting the mathematical representation of a 3D world into pixel colors on a 2D screen.”

Summarizing the core flow:

  1. Application Stage (CPU): Deciding what to render and issuing commands to the GPU
  2. Coordinate Transformations: Object space → World → View → Clip → NDC → Screen
  3. Rasterization: Mapping triangles onto the pixel grid, interpolating attributes
  4. Fragment Shader: Calculating pixel colors from interpolated data
  5. Output Merger: Finalizing with depth/stencil testing and blending

Selection criteria for rendering architectures:

ScenarioRecommended Architecture
Mobile, low lighting, VRForward
PC/Console, high lightingDeferred
High lighting, transparency neededForward+
UE5 Next-GenNanite + Lumen

And the flow of modern rendering is clear: it is evolving towards GPU-driven rendering, reducing CPU dependency. Nanite, Mesh Shader, and Lumen are all products of this direction.

Understanding light physics connects each step of the rendering pipeline to why it was designed that way.

ResourceTypeDescription
Real-Time Rendering, 4th EditionBookThe Bible of rendering pipelines
GPU Gems Series (NVIDIA)Book/OnlineCollection of practical techniques
A Trip Through the Graphics Pipeline (Fabian Giesen)BlogExplanation of internal operations within the GPU pipeline
Advances in Real-Time Rendering (SIGGRAPH)PresentationAnnual presentation on the latest rendering technologies
Physics and Math of Shading (Naty Hoffman)PresentationThe definitive guide to PBR mathematics
g-Matrix3d neo (Professor Sungwan Kim)GitHubOpen-source implementation of a rendering engine
This post is licensed under CC BY 4.0 by the author.