Post

VRAM Deep Dive: GPU Memory Hierarchy and LLM Model Loading

VRAM Deep Dive: GPU Memory Hierarchy and LLM Model Loading
Visitors

This document is a supplement to Section 7, “Hardware Configuration,” of How LLMs Work: A Guide for Game Developers.


1. GPU Memory Hierarchy

Analogy with the Game Rendering Pipeline

For game developers, the GPU memory hierarchy is familiar. Just as a shader accesses VRAM through a texture cache when sampling a texture, LLM inference goes through the same memory hierarchy. The only difference is that the data being accessed are float values of weight matrices instead of texture pixels.

Hierarchy Structure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Registers
│  Capacity: ~Few KB/SM  |  Bandwidth: Max  |  Latency: Min (~1 cycle)
│  Role: Holds values currently being computed
▼
Shared Memory / L1 Cache (SRAM)
│  Capacity: 128-256 KB/SM  |  Bandwidth: ~Tens of TB/s  |  Latency: ~Tens of cycles
│  Role: Data sharing between threads within an SM (Streaming Multiprocessor)
│  Game Analogy: groupshared memory in a shader
▼
L2 Cache
│  Capacity: 50-100 MB (Entire chip)  |  Bandwidth: ~Few TB/s  |  Latency: ~Hundreds of cycles
│  Role: Caching shared data between SMs
│  Game Analogy: Texture cache
▼
VRAM (GDDR / HBM)
   Capacity: 24-80 GB  |  Bandwidth: ~1-3 TB/s (GDDR7 - HBM3e)  |  Latency: ~Hundreds of cycles
   Role: Storage for model weights, KV cache, activation data
   Game Analogy: GPU memory where textures/meshes/framebuffers reside

Mermaid Diagram

flowchart TB
    subgraph GPU_Chip["Inside GPU Chip"]
        direction TB
        REG["Registers<br/>~Few KB/SM<br/>Fastest"]
        SRAM["Shared Memory / L1 Cache (SRAM)<br/>128-256 KB/SM<br/>Shared by threads in SM"]
        L2["L2 Cache<br/>50-100 MB<br/>Shared between SMs"]
    end

    HBM["VRAM (GDDR/HBM)<br/>24-80 GB<br/>Model weights reside here"]

    REG -->|"Fast"| SRAM
    SRAM -->|"Medium"| L2
    L2 -->|"Relatively Slow"| HBM

    style REG fill:#FF6B6B,color:#fff
    style SRAM fill:#FFA07A
    style L2 fill:#FFFACD
    style HBM fill:#90EE90

Role of Each Layer (LLM Inference Focus)

LayerCapacityBandwidthRole in LLMsGame Rendering Analogy
Registers~Few KB/SMMaxOperands for current matrix multiplicationShader variables
SRAM (L1/Shared)128-256 KB/SM~Tens of TB/sIntermediate results of Attention tiling (Core of FlashAttention)groupshared memory
L2 Cache50-100 MB~Few TB/sCaching frequently accessed weights/KV cacheTexture cache
VRAM (GDDR/HBM)24-80 GB~1-3 TB/sStores full model weights, KV cache, activationsTextures/Framebuffers
GPU Memory Hierarchy — Bandwidth Comparison by Layer (Approximate, Log Scale)

Why the Memory Hierarchy Matters: The Memory-bound Problem

The core bottleneck of LLM inference is not computation speed, but memory bandwidth. The GPU’s computation units (Tensor Cores) can process data extremely quickly once it arrives, but the speed of reading data from VRAM cannot keep up with the computation. This is called a Memory-bound state.

In game terms, this is like a situation where the shader code itself is simple, but texture sampling becomes the bottleneck, resulting in low GPU occupancy.

1
2
3
Compute Power:  ████████████████████ (1,000+ TFLOPS)
VRAM Bandwidth: ████████░░░░░░░░░░░░ (~3 TB/s)
                             ↑ Bottleneck here!

The FlashAttention series (v1-v4) solves this exact problem. It minimizes VRAM access and completes computations within SRAM (shared memory) as much as possible to bypass the VRAM bandwidth bottleneck.


2. VRAM Capacity and Model Loading in Practice

What Goes into VRAM

Three main elements occupy VRAM during LLM inference:

flowchart LR
    subgraph VRAM["VRAM Space Allocation"]
        direction TB
        W["Model Weights<br/>(Full parameters)"]
        KV["KV Cache<br/>(Conversation context)"]
        ACT["Activation Memory<br/>(Intermediate results)"]
        OS["OS/Driver Reserved<br/>(~0.5-1 GB)"]
    end

    style W fill:#FFB6C1
    style KV fill:#FFFACD
    style ACT fill:#90EE90
    style OS fill:#D3D3D3

(1) Model Weights

This is the entirety of the parameters after training. They are loaded into VRAM once and remain there until the server is shut down.

Calculation Formula:

1
2
3
4
5
6
Weight Memory = Parameter Count × Precision Bytes

Example: Llama 3.1 70B (FP16)
= 70,000,000,000 × 2 bytes
= 140,000,000,000 bytes
= 140 GB
ModelParametersFP16INT8 (Quantized)INT4 (Quantized)
Llama 3.2 3B3B6 GB3 GB1.5 GB
Llama 3.1 8B8B16 GB8 GB4 GB
Llama 3.1 70B70B140 GB70 GB35 GB
DeepSeek-V3671B (MoE, 37B Active)~1,342 GB*~671 GB*~336 GB*
FP16 Full Precision Load
# Llama 3.1 8B = Requires 16 GB VRAM
# Barely possible on RTX 4090 (24GB)
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    torch_dtype=torch.float16,
    device_map="auto"
)
# VRAM: ~16 GB
# Quality: Best (Original precision)
INT4 Quantized Load (bitsandbytes)
# Llama 3.1 8B = Reduced to 4 GB VRAM!
# Can run even on RTX 3060 (12GB)
from transformers import (
    AutoModelForCausalLM, BitsAndBytesConfig
)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto"
)
# VRAM: ~4 GB (75% savings)
# Quality: Slight loss compared to FP16

*Since MoE models require all Expert weights to reside in VRAM, memory requirements are calculated based on the total parameters (671B). However, since only active parameters (37B) are used for actual per-token computation, the inference speed (compute cost) is comparable to a Dense model of similar scale (~37B).

(2) KV Cache (Key-Value Cache)

This is the cache where the Transformer stores information from previous tokens. It grows as the conversation gets longer.

Calculation Formula:

1
2
3
4
5
6
7
KV Cache Memory = 2 × Layer Count × KV Head Count × Head Dim × Sequence Length × Batch Size × Precision Bytes

* Models using GQA/MQA have fewer KV heads than Query heads, significantly reducing KV cache size.

Example: Llama 3.1 70B (GQA: 8 KV heads, Head Dim 128), 4096 tokens, Batch size 1 (FP16)
= 2 × 80 × 8 × 128 × 4096 × 1 × 2 bytes
≈ 1.25 GB

KV cache is dynamic. It continues to grow as the conversation with the user lengthens and can reach tens of GBs if the entire context window (e.g., 128K tokens) is used.

Context LengthKV Cache (70B, GQA, FP16)Notes
2,048 tokens~0.6 GBShort conversation
8,192 tokens~2.5 GBGeneral conversation
32,768 tokens~10 GBLong document analysis
128,000 tokens~39 GBFull context window

This is why techniques like GQA (Grouped Query Attention) and MQA (Multi-Query Attention) are important. They reduce memory by decreasing the number of Key/Value heads in the KV cache.

(3) Activation Memory

This consists of the intermediate computation results of each layer during the forward pass. While smaller than training, it increases proportionally with the batch size.

Generally, activation memory during inference is relatively small (on the order of a few GBs) compared to weights or the KV cache, making up a low percentage of the total VRAM budget.

Example VRAM Requirement Calculation

1
2
3
4
5
6
7
8
9
10
11
12
Llama 3.1 70B (INT4 Quantized, 8K Context, Batch 1):

Weights:       35 GB (INT4)
KV Cache:      ~1.3 GB (INT8 KV Quantization + GQA applied)
Activations:   ~2 GB
OS/Driver:     ~1 GB
─────────────────────
Total Needed:  ~39.3 GB

→ NVIDIA A100 80GB: Plenty of room
→ NVIDIA RTX 4090 24GB: Insufficient → requires offloading
→ Mac M4 Max 64GB (UMA): Possible (but speed is slower)

When VRAM is Insufficient: Offloading

What happens if the entire model doesn’t fit in VRAM?

flowchart LR
    subgraph Normal["Sufficient VRAM"]
        direction TB
        V1["All layers reside in VRAM"]
        V2["Computation solely on GPU"]
        V3["Fast"]
    end

    subgraph Offload["Insufficient VRAM → Offloading"]
        direction TB
        O1["Keep some layers in System RAM"]
        O2["Transfer via PCIe when needed"]
        O3["Severe speed drop (10-100x)"]
    end

    subgraph OOM["Complete VRAM Shortage → OOM"]
        direction TB
        M1["Model loading fails"]
        M2["Out of Memory error"]
    end

    Normal -->|"Model > VRAM"| Offload
    Offload -->|"Model >> VRAM+RAM"| OOM

Why performance drops so drastically during offloading:

1
2
VRAM (HBM3e) Bandwidth:  ~3,000 GB/s
PCIe 5.0 Bandwidth:       ~64 GB/s   ← Approximately 47x slower

In game terms, this is like a situation in texture streaming where VRAM is so insufficient that textures must be uploaded from system RAM every frame. Just as frame drops would be severe, LLM inference speed falls off a cliff.

Practical GPU Selection Guide

Model ScaleRecommended Min VRAMRecommended GPUNotes
~3B (INT4)4 GBIntegrated GPU, Mobile NPUSimple local execution possible
~8B (INT4)6 GBRTX 3060 12GB, M4 16GBGeneral-purpose assistant level
~13B (INT4)10 GBRTX 4070 12GB, M4 Pro 24GBPractical local model
~34B (INT4)20 GBRTX 4090 24GB, M4 Max 36GBSemi-professional level
~70B (INT4)40 GBA100 80GB, M4 Max 64GBProfessional analysis level
~200B+ (INT4)128 GB+Multi-GPU, Mac Studio UltraLarge models, research

Summary

1
2
3
4
5
VRAM = LLM's Work Desk
├── The wider the desk (more VRAM) → The larger the models you can load
├── Distance from the desk to hands (Bandwidth) → Determines computation speed
├── If the desk is narrow (VRAM shortage) → Putting items on the floor and walking back and forth (offloading) → Slows down
└── Quantization = Swapping books for summarized versions to save desk space

Key Points:

  1. LLM inference is Memory-bound — memory bandwidth is the bottleneck rather than compute speed.
  2. VRAM stores weights + KV cache + activations — the KV cache grows as the conversation lengthens.
  3. If VRAM is insufficient, speed drops 10-100x due to offloading — reducing weight size through quantization is a key strategy.
  4. FlashAttention is a technique that bypasses Memory-bound bottlenecks by minimizing VRAM access.

This document is a supplement to Section 7, “Hardware Configuration,” of How LLMs Work: A Guide for Game Developers.

This post is licensed under CC BY 4.0 by the author.