VRAM Deep Dive: GPU Memory Hierarchy and LLM Model Loading
- GPU Compute Units Deep Dive: CUDA Core, Tensor Core, and NPU
- How LLMs Work: A Guide for Game Developers
- VRAM Deep Dive: GPU Memory Hierarchy and LLM Model Loading
- Playing DOOM with Brain Cells — In-Depth Analysis of Biological Neural Computing: Present and Future
- The Complete LLM Guide for Non-Developers — How Planners, QA, and Marketers Can Use AI at Work
This document is a supplement to Section 7, “Hardware Configuration,” of How LLMs Work: A Guide for Game Developers.
1. GPU Memory Hierarchy
Analogy with the Game Rendering Pipeline
For game developers, the GPU memory hierarchy is familiar. Just as a shader accesses VRAM through a texture cache when sampling a texture, LLM inference goes through the same memory hierarchy. The only difference is that the data being accessed are float values of weight matrices instead of texture pixels.
Hierarchy Structure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Registers
│ Capacity: ~Few KB/SM | Bandwidth: Max | Latency: Min (~1 cycle)
│ Role: Holds values currently being computed
▼
Shared Memory / L1 Cache (SRAM)
│ Capacity: 128-256 KB/SM | Bandwidth: ~Tens of TB/s | Latency: ~Tens of cycles
│ Role: Data sharing between threads within an SM (Streaming Multiprocessor)
│ Game Analogy: groupshared memory in a shader
▼
L2 Cache
│ Capacity: 50-100 MB (Entire chip) | Bandwidth: ~Few TB/s | Latency: ~Hundreds of cycles
│ Role: Caching shared data between SMs
│ Game Analogy: Texture cache
▼
VRAM (GDDR / HBM)
Capacity: 24-80 GB | Bandwidth: ~1-3 TB/s (GDDR7 - HBM3e) | Latency: ~Hundreds of cycles
Role: Storage for model weights, KV cache, activation data
Game Analogy: GPU memory where textures/meshes/framebuffers reside
Mermaid Diagram
flowchart TB
subgraph GPU_Chip["Inside GPU Chip"]
direction TB
REG["Registers<br/>~Few KB/SM<br/>Fastest"]
SRAM["Shared Memory / L1 Cache (SRAM)<br/>128-256 KB/SM<br/>Shared by threads in SM"]
L2["L2 Cache<br/>50-100 MB<br/>Shared between SMs"]
end
HBM["VRAM (GDDR/HBM)<br/>24-80 GB<br/>Model weights reside here"]
REG -->|"Fast"| SRAM
SRAM -->|"Medium"| L2
L2 -->|"Relatively Slow"| HBM
style REG fill:#FF6B6B,color:#fff
style SRAM fill:#FFA07A
style L2 fill:#FFFACD
style HBM fill:#90EE90
Role of Each Layer (LLM Inference Focus)
| Layer | Capacity | Bandwidth | Role in LLMs | Game Rendering Analogy |
|---|---|---|---|---|
| Registers | ~Few KB/SM | Max | Operands for current matrix multiplication | Shader variables |
| SRAM (L1/Shared) | 128-256 KB/SM | ~Tens of TB/s | Intermediate results of Attention tiling (Core of FlashAttention) | groupshared memory |
| L2 Cache | 50-100 MB | ~Few TB/s | Caching frequently accessed weights/KV cache | Texture cache |
| VRAM (GDDR/HBM) | 24-80 GB | ~1-3 TB/s | Stores full model weights, KV cache, activations | Textures/Framebuffers |
Why the Memory Hierarchy Matters: The Memory-bound Problem
The core bottleneck of LLM inference is not computation speed, but memory bandwidth. The GPU’s computation units (Tensor Cores) can process data extremely quickly once it arrives, but the speed of reading data from VRAM cannot keep up with the computation. This is called a Memory-bound state.
In game terms, this is like a situation where the shader code itself is simple, but texture sampling becomes the bottleneck, resulting in low GPU occupancy.
1
2
3
Compute Power: ████████████████████ (1,000+ TFLOPS)
VRAM Bandwidth: ████████░░░░░░░░░░░░ (~3 TB/s)
↑ Bottleneck here!
The FlashAttention series (v1-v4) solves this exact problem. It minimizes VRAM access and completes computations within SRAM (shared memory) as much as possible to bypass the VRAM bandwidth bottleneck.
2. VRAM Capacity and Model Loading in Practice
What Goes into VRAM
Three main elements occupy VRAM during LLM inference:
flowchart LR
subgraph VRAM["VRAM Space Allocation"]
direction TB
W["Model Weights<br/>(Full parameters)"]
KV["KV Cache<br/>(Conversation context)"]
ACT["Activation Memory<br/>(Intermediate results)"]
OS["OS/Driver Reserved<br/>(~0.5-1 GB)"]
end
style W fill:#FFB6C1
style KV fill:#FFFACD
style ACT fill:#90EE90
style OS fill:#D3D3D3
(1) Model Weights
This is the entirety of the parameters after training. They are loaded into VRAM once and remain there until the server is shut down.
Calculation Formula:
1
2
3
4
5
6
Weight Memory = Parameter Count × Precision Bytes
Example: Llama 3.1 70B (FP16)
= 70,000,000,000 × 2 bytes
= 140,000,000,000 bytes
= 140 GB
| Model | Parameters | FP16 | INT8 (Quantized) | INT4 (Quantized) |
|---|---|---|---|---|
| Llama 3.2 3B | 3B | 6 GB | 3 GB | 1.5 GB |
| Llama 3.1 8B | 8B | 16 GB | 8 GB | 4 GB |
| Llama 3.1 70B | 70B | 140 GB | 70 GB | 35 GB |
| DeepSeek-V3 | 671B (MoE, 37B Active) | ~1,342 GB* | ~671 GB* | ~336 GB* |
# Llama 3.1 8B = Requires 16 GB VRAM
# Barely possible on RTX 4090 (24GB)
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
torch_dtype=torch.float16,
device_map="auto"
)
# VRAM: ~16 GB
# Quality: Best (Original precision)# Llama 3.1 8B = Reduced to 4 GB VRAM!
# Can run even on RTX 3060 (12GB)
from transformers import (
AutoModelForCausalLM, BitsAndBytesConfig
)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=bnb_config,
device_map="auto"
)
# VRAM: ~4 GB (75% savings)
# Quality: Slight loss compared to FP16*Since MoE models require all Expert weights to reside in VRAM, memory requirements are calculated based on the total parameters (671B). However, since only active parameters (37B) are used for actual per-token computation, the inference speed (compute cost) is comparable to a Dense model of similar scale (~37B).
(2) KV Cache (Key-Value Cache)
This is the cache where the Transformer stores information from previous tokens. It grows as the conversation gets longer.
Calculation Formula:
1
2
3
4
5
6
7
KV Cache Memory = 2 × Layer Count × KV Head Count × Head Dim × Sequence Length × Batch Size × Precision Bytes
* Models using GQA/MQA have fewer KV heads than Query heads, significantly reducing KV cache size.
Example: Llama 3.1 70B (GQA: 8 KV heads, Head Dim 128), 4096 tokens, Batch size 1 (FP16)
= 2 × 80 × 8 × 128 × 4096 × 1 × 2 bytes
≈ 1.25 GB
KV cache is dynamic. It continues to grow as the conversation with the user lengthens and can reach tens of GBs if the entire context window (e.g., 128K tokens) is used.
| Context Length | KV Cache (70B, GQA, FP16) | Notes |
|---|---|---|
| 2,048 tokens | ~0.6 GB | Short conversation |
| 8,192 tokens | ~2.5 GB | General conversation |
| 32,768 tokens | ~10 GB | Long document analysis |
| 128,000 tokens | ~39 GB | Full context window |
This is why techniques like GQA (Grouped Query Attention) and MQA (Multi-Query Attention) are important. They reduce memory by decreasing the number of Key/Value heads in the KV cache.
(3) Activation Memory
This consists of the intermediate computation results of each layer during the forward pass. While smaller than training, it increases proportionally with the batch size.
Generally, activation memory during inference is relatively small (on the order of a few GBs) compared to weights or the KV cache, making up a low percentage of the total VRAM budget.
Example VRAM Requirement Calculation
1
2
3
4
5
6
7
8
9
10
11
12
Llama 3.1 70B (INT4 Quantized, 8K Context, Batch 1):
Weights: 35 GB (INT4)
KV Cache: ~1.3 GB (INT8 KV Quantization + GQA applied)
Activations: ~2 GB
OS/Driver: ~1 GB
─────────────────────
Total Needed: ~39.3 GB
→ NVIDIA A100 80GB: Plenty of room
→ NVIDIA RTX 4090 24GB: Insufficient → requires offloading
→ Mac M4 Max 64GB (UMA): Possible (but speed is slower)
When VRAM is Insufficient: Offloading
What happens if the entire model doesn’t fit in VRAM?
flowchart LR
subgraph Normal["Sufficient VRAM"]
direction TB
V1["All layers reside in VRAM"]
V2["Computation solely on GPU"]
V3["Fast"]
end
subgraph Offload["Insufficient VRAM → Offloading"]
direction TB
O1["Keep some layers in System RAM"]
O2["Transfer via PCIe when needed"]
O3["Severe speed drop (10-100x)"]
end
subgraph OOM["Complete VRAM Shortage → OOM"]
direction TB
M1["Model loading fails"]
M2["Out of Memory error"]
end
Normal -->|"Model > VRAM"| Offload
Offload -->|"Model >> VRAM+RAM"| OOM
Why performance drops so drastically during offloading:
1
2
VRAM (HBM3e) Bandwidth: ~3,000 GB/s
PCIe 5.0 Bandwidth: ~64 GB/s ← Approximately 47x slower
In game terms, this is like a situation in texture streaming where VRAM is so insufficient that textures must be uploaded from system RAM every frame. Just as frame drops would be severe, LLM inference speed falls off a cliff.
Practical GPU Selection Guide
| Model Scale | Recommended Min VRAM | Recommended GPU | Notes |
|---|---|---|---|
| ~3B (INT4) | 4 GB | Integrated GPU, Mobile NPU | Simple local execution possible |
| ~8B (INT4) | 6 GB | RTX 3060 12GB, M4 16GB | General-purpose assistant level |
| ~13B (INT4) | 10 GB | RTX 4070 12GB, M4 Pro 24GB | Practical local model |
| ~34B (INT4) | 20 GB | RTX 4090 24GB, M4 Max 36GB | Semi-professional level |
| ~70B (INT4) | 40 GB | A100 80GB, M4 Max 64GB | Professional analysis level |
| ~200B+ (INT4) | 128 GB+ | Multi-GPU, Mac Studio Ultra | Large models, research |
Summary
1
2
3
4
5
VRAM = LLM's Work Desk
├── The wider the desk (more VRAM) → The larger the models you can load
├── Distance from the desk to hands (Bandwidth) → Determines computation speed
├── If the desk is narrow (VRAM shortage) → Putting items on the floor and walking back and forth (offloading) → Slows down
└── Quantization = Swapping books for summarized versions to save desk space
Key Points:
- LLM inference is Memory-bound — memory bandwidth is the bottleneck rather than compute speed.
- VRAM stores weights + KV cache + activations — the KV cache grows as the conversation lengthens.
- If VRAM is insufficient, speed drops 10-100x due to offloading — reducing weight size through quantization is a key strategy.
- FlashAttention is a technique that bypasses Memory-bound bottlenecks by minimizing VRAM access.
This document is a supplement to Section 7, “Hardware Configuration,” of How LLMs Work: A Guide for Game Developers.
