Post

GPU Compute Units Deep Dive: CUDA Core, Tensor Core, and NPU

GPU Compute Units Deep Dive: CUDA Core, Tensor Core, and NPU
Visitors

This document is a supplement to Section 7, “Hardware Configuration,” of How LLMs Work: A Guide for Game Developers. For an in-depth look at memory, please refer to the VRAM Deep Dive Guide.


Overview: Who Performs the Calculations?

The core of LLM inference is matrix multiplication. It involves repeating the task of multiplying and adding billions of numbers. “Who” performs these calculations can result in a speed difference of hundreds to thousands of times.

flowchart LR
    subgraph CPU_Area["CPU"]
        CPU["General-purpose cores (8-24)<br/>Strong in complex branching/control<br/>Unsuitable for matrix multiplication"]
    end

    subgraph GPU_Area["NVIDIA GPU"]
        CUDA["CUDA Cores (Thousands)<br/>Massive parallel simple arithmetic"]
        TC["Tensor Cores (Hundreds)<br/>Dedicated acceleration for matrix mult"]
    end

    subgraph NPU_Area["NPU"]
        NPU["Dedicated AI chips<br/>Specialized for low-power inference"]
    end

    CPU_Area -->|"Slow"| GPU_Area
    GPU_Area -->|"Fast"| Result["LLM Inference"]
    NPU_Area -->|"Efficient"| Result

    style CPU_Area fill:#FFB6C1
    style GPU_Area fill:#90EE90
    style NPU_Area fill:#87CEEB

Analogy for game developers:

  • CPU = Complex branching in game logic, AI decision-making, or physics simulation.
  • CUDA Core = Vertex shaders transforming thousands of vertices simultaneously.
  • Tensor Core = Dedicated units, like RT Cores for Ray Tracing, that perform specific operations extremely fast.
  • NPU = A low-power image processing DSP in a mobile device.

1. CUDA Core: The Basic Unit of General Parallel Computing

What is a CUDA Core?

A CUDA (Compute Unified Device Architecture) Core is the most basic computing unit of an NVIDIA GPU. Each CUDA Core can perform one floating-point (float) or integer (int) operation. GPUs are powerful because they have thousands of these cores, all performing calculations simultaneously.

Role in Game Rendering

For game developers, CUDA Cores are very familiar. The shaders we write execute directly on these cores:

1
2
3
Vertex Shader:    Transforms each vertex → Handled by one CUDA Core
Fragment Shader:  Calculates each pixel color → Handled by one CUDA Core
Compute Shader:   Each thread → Handled by one CUDA Core

When rendering a frame at 1080p resolution, colors for about 2 million pixels must be calculated. Sequential processing by a CPU would be extremely slow, but with thousands of CUDA Cores working in parallel, it’s completed within 16ms (60 FPS).

Role in LLMs

In LLM inference, CUDA Cores perform the following tasks:

OperationDescriptionRole of CUDA Cores
Activation FunctionsReLU, GELU, SiLU, etc.Applying non-linear functions to each element
LayerNormNormalization operationsMean/variance calculation, scaling
SoftmaxCalculating probability distributionExponential functions, summation, division
Element-wise OpsAddition, Residual connectionsParallel processing of vector elements
Token SamplingTop-K, Top-P filteringProbability sorting and sampling

However, matrix multiplication, the heaviest operation in LLMs, is handled by Tensor Cores instead of CUDA Cores. While CUDA Cores can perform matrix multiplication, they are 10-20 times slower than Tensor Cores.

CUDA Core Count by GPU

GPUCUDA Core CountReleasedPrimary Use
RTX 30603,5842021Consumer Gaming
RTX 409016,3842022Consumer Flagship
RTX 509021,7602025Consumer Flagship
A1006,9122020Data Center AI
H10014,5922023Data Center AI
B20018,4322025Data Center AI (Blackwell)
CUDA Core Count Comparison (Consumer vs. Data Center)

Note: Data center GPUs (A100, H100, B200) may have fewer CUDA Cores than some consumer models, but they are overwhelming in Tensor Core count and memory bandwidth. What matters for LLM inference is the combination of Tensor Cores + VRAM Bandwidth, not just the CUDA Core count.

CUDA Core Execution Structure: SMs and Warps

CUDA Cores do not operate individually but are grouped into units called SMs (Streaming Multiprocessors):

flowchart TB
    subgraph GPU["NVIDIA GPU"]
        subgraph SM1["SM #0"]
            direction LR
            CC1["CUDA Cores ×128"]
            TC1["Tensor Cores ×4"]
            SRAM1["Shared Memory (SRAM)<br/>128 KB"]
        end
        subgraph SM2["SM #1"]
            direction LR
            CC2["CUDA Cores ×128"]
            TC2["Tensor Cores ×4"]
            SRAM2["Shared Memory (SRAM)<br/>128 KB"]
        end
        SMN["... SM #N"]
    end

    L2["L2 Cache (50-100 MB)"]
    HBM["VRAM / HBM (24-80 GB)"]

    SM1 & SM2 & SMN --> L2 --> HBM

Analogy in game development:

  • SM = Compute Unit (Shader execution unit)
  • Warp (32 threads) = SIMD lane (32 threads executing the same instruction simultaneously)
  • Shared Memory = groupshared memory in a Unity Compute Shader
1
2
3
4
5
6
7
8
9
10
11
CUDA Programming Hierarchy:
Grid (Total Work)
└── Block (Assigned to an SM)
    └── Warp (32 threads, simultaneous execution)
        └── Thread (Individual CUDA Core)

Game Shader Analogy:
Dispatch
└── Thread Group (Assigned to a Compute Unit)
    └── Wavefront/Warp (SIMD execution)
        └── Thread (Individual shader instance)

2. Tensor Core: Dedicated AI Matrix Multiplication Accelerator

What is a Tensor Core?

A Tensor Core is a hardware unit dedicated to matrix multiplication built into NVIDIA GPUs. While CUDA Cores perform scalar (single number) operations, Tensor Cores operate on matrix units.

Key Difference: Scalar vs. Matrix Operations

1
2
3
4
5
6
7
CUDA Core (Scalar):
  a × b + c = d
  → 1 operation results in 1 value

Tensor Core (Matrix):
  A(4×4) × B(4×4) + C(4×4) = D(4×4)
  → 1 operation results in 64 values (FMA: Fused Multiply-Add)

One Tensor Core performs a multiply-add on a 4x4 matrix in a single clock cycle. Performing the same operation with CUDA Cores would require 64 multiplications and 48 additions. This is why Tensor Cores are 10-20 times faster for matrix multiplication.

Analogy in game development:

  • CUDA Core = General-purpose shader ALU (handles all types of math)
  • Tensor Core = RT Core (accelerates Ray Tracing only) or Texture Unit (accelerates texture filtering only)

Just as RT Cores accelerate BVH traversal in hardware, Tensor Cores accelerate matrix multiplication in hardware. While it’s possible in software, dedicated hardware is overwhelmingly faster.

Evolution of Tensor Cores

GenerationGPUSupported PrecisionKey ImprovementsImpact on LLMs
1st GenV100 (2017)FP16First introduction of Tensor CoresBeginning of AI training acceleration
2nd GenA100 (2020)FP16, BF16, TF32, INT8Support for Structured Sparsity2x inference speed, mixed precision
3rd GenH100 (2023)FP16, BF16, FP8, INT8FP8 support, Transformer EngineTraining/Inference possible in FP8
4th GenB200 (2025)FP16, BF16, FP8, FP4, INT8FP4 support, TMEMUltra-low precision inference, FlashAttention-4

Why Precision Matters

Tensor Cores support various precision modes. Lower precision is faster and uses less memory, but can sacrifice accuracy:

1
2
3
4
5
6
FP32 (32-bit): ████████████████████████████████  Highest accuracy, default speed
FP16 (16-bit): ████████████████                  Good accuracy, 2x faster
BF16 (16-bit): ████████████████                  Best for training, 2x faster
FP8  (8-bit):  ████████                          Sufficient accuracy, 4x faster
FP4  (4-bit):  ████                              Inference only, 8x faster
INT8 (8-bit):  ████████                          Quantized inference, 4x faster

BF16 vs. FP16:

  • FP16: Wide mantissa, narrow exponent → Precise but limited range for large numbers.
  • BF16: Narrow mantissa, wide exponent → Less precise but supports larger range → Ideal for gradients during training.

Game Analogy: Similar to the trade-off between FP16 textures and R11G11B10 textures in HDR rendering. It’s a balance between precision and memory/performance.

Transformer Engine: Automatic Precision Management

Introduced with the H100, the Transformer Engine is an intelligent precision management system running on top of Tensor Cores:

flowchart LR
    Input["Input Tensor"] --> TE["Transformer Engine"]
    TE -->|"Layer-by-layer Analysis"| DecisionDetermine Precision
    Decision -->|"FP8 is sufficient"| FP8["FP8 Tensor Core Op<br/>(Fast)"]
    Decision -->|"FP16 is required"| FP16["FP16 Tensor Core Op<br/>(Precise)"]
    FP8 & FP16 --> Output["Output Tensor"]

By analyzing data distribution for each Transformer layer in real-time, it automatically selects the lowest possible precision without loss of accuracy. This allows for optimal performance without manual quantization tuning.

CUDA Core vs. Tensor Core: Division of Labor

In LLM inference, the two units work together:

flowchart TB
    subgraph Layer["One Transformer Layer"]
        direction TB
        LN1["LayerNorm<br/>→ CUDA Core"]
        ATT["Attention Q·K^T, Softmax·V<br/>→ Tensor Core (Matrix Mult)<br/>→ CUDA Core (Softmax)"]
        LN2["LayerNorm<br/>→ CUDA Core"]
        FFN["Feed Forward Network<br/>→ Tensor Core (Matrix Mult)<br/>→ CUDA Core (Activation)"]
    end

    LN1 --> ATT --> LN2 --> FFN

    style LN1 fill:#87CEEB
    style ATT fill:#90EE90
    style LN2 fill:#87CEEB
    style FFN fill:#90EE90
Operation PhasePrimary Execution UnitComputation Load (FLOPs)
Matrix Multiplication (QKV, FFN)Tensor Core~95%
Softmax, LayerNorm, ActivationsCUDA Core~5%

Since about 95% of total computation consists of matrix multiplication, the performance of Tensor Cores effectively determines LLM inference speed.


3. Tensor Memory (TMEM): Dedicated On-chip Memory for Blackwell

What is TMEM?

Tensor Memory (TMEM) is dedicated on-chip memory for Tensor Cores newly introduced in the NVIDIA Blackwell (B200) architecture. With 256KB of dedicated SRAM per SM, it allows Tensor Cores to maintain results inside the chip instead of sending them out to VRAM (HBM).

Why is it Necessary?

In previous architectures (Hopper/H100 and earlier), intermediate results from Tensor Cores had to be stored in Shared Memory (SRAM) or VRAM (HBM). The issue is that Shared Memory is shared with CUDA Cores, leading to contention, and HBM access is slow.

1
2
3
4
5
6
7
Before (Pre-Hopper):
Tensor Core → Shared Memory (Contention with CUDA Cores) → HBM (Slow)
                    ↑ Bottleneck

Blackwell (With TMEM):
Tensor Core → TMEM (Dedicated, No Contention) → HBM (Only when needed)
                    ↑ Bottleneck Eliminated

Game Development Analogy

In a Unity rendering pipeline:

  • Previous: The shader writes intermediate results to a RenderTexture (VRAM) every time and reads them back.
  • TMEM: The shader maintains intermediate results in registers/LDS → Eliminates VRAM round-trips.

It’s similar to the optimization in a Compute Shader where you use groupshared memory to reduce global memory access.

Relationship with FlashAttention-4

One of the key factors that allowed FlashAttention-4 to achieve 1,600+ TFLOPS (based on early reports) on Blackwell is TMEM:

flowchart LR
    subgraph FA4["FlashAttention-4 Pipeline"]
        direction TB
        Load["Load Q, K, V from HBM"]
        MatMul["Matrix Mult<br/>(Tensor Core)"]
        TMEM_Store["Keep intermediate results in TMEM<br/>(Eliminate HBM Round-trips)"]
        Softmax["Calculate Softmax<br/>(CUDA Core)"]
        Final["Write final result only to HBM"]
    end

    Load --> MatMul --> TMEM_Store --> Softmax --> Final

Previously, intermediate results of Attention (the Q·K^T matrix) had to be written to HBM, but keeping them in TMEM bypasses the HBM bandwidth bottleneck.

On-chip Memory Comparison by GPU Generation

ArchitectureGPUShared Memory / SMTensor Core Dedicated MemoryNotes
AmpereA100164 KBNoneShared by CUDA/Tensor Cores
HopperH100228 KBNone (Improved Shared Memory)Asynchronous access with TMA
BlackwellB200228 KB256 KB TMEMSeparated for Tensor Cores

4. NPU (Neural Processing Unit): Dedicated AI Chips Outside the GPU

What is an NPU?

An NPU (Neural Processing Unit) is an independent processor specialized for AI inference. While a GPU’s Tensor Core is an “AI acceleration unit” inside the GPU, an NPU is a separate chip (or core block) from the CPU/GPU.

flowchart TB
    subgraph SoC["Apple M4 Chip (Single SoC)"]
        direction LR
        CPU_Block["CPU<br/>12-core<br/>General purpose"]
        GPU_Block["GPU<br/>40-core<br/>Graphics + Parallel"]
        NPU_Block["Neural Engine (NPU)<br/>16-core<br/>AI Inference Only"]
        Media["Media Engine<br/>Video Enc/Dec"]
    end

    UMA["Unified Memory (UMA)"]
    CPU_Block & GPU_Block & NPU_Block & Media --> UMA

    style NPU_Block fill:#87CEEB

NPU vs. GPU: Differences in Design Philosophy

 GPU (NVIDIA)NPU (Apple Neural Engine, etc.)
Design GoalMax throughputMax power efficiency (perf/watt)
PrecisionFP32 to FP4 (Flexible)Fixed INT8/FP16 (Limited)
MemoryDedicated VRAM (up to 80GB)Shared System RAM
Power Consumption300-700W5-15W
ProgrammingCUDA (High flexibility)Core ML, ONNX (Limited)
Best ForTraining + Large Model InferenceLightweight Inference, On-device AI

Major NPU Types

NPUDevicePerformance (TOPS)Features
Apple Neural EngineiPhone, iPad, Mac38 TOPS (M4)High bandwidth access via UMA
Qualcomm HexagonAndroid Smartphones45 TOPS (SD 8 Gen 3)Optimized for mobile AI
Intel NPULatest Intel Laptops10-11 TOPSWindows AI PC
Google TPUGoogle Data CentersHundreds of TOPSServer-grade AI chip

TOPS: Tera Operations Per Second. Usually measured based on INT8.

When NPUs are Suitable (and Not) for LLMs

Suitable For:

  • Running lightweight models (under 3B) on smartphones (On-device AI).
  • Small models like voice recognition or image classification.
  • Mobile environments where battery life is critical.
  • Always-on background AI features.

NOT Suitable For:

  • Large model inference (70B+ models require more memory/compute).
  • Model training (no training functionality).
  • Complex custom operations (lack of programming flexibility).
  • Long context processing (KV cache memory limitations).

Potential NPU Use in Game Development

Use CaseDescriptionFeasibility
NPC DialogueGenerate dialogues locally with lightweight LLMsPossible (3B model level)
Voice RecognitionRecognize player voice commandsPractical (e.g., Whisper)
Image RecognitionCamera/AR-based game featuresPractical
Procedural GenAI-based content/level generationLimited (Model size limits)
Real-time TransTranslate multiplayer chatPossible (Lightweight translation models)

5. Overall Comparison: CPU vs. CUDA Core vs. Tensor Core vs. NPU

At a Glance

flowchart TB
    subgraph Comparison["Compute Unit Comparison"]
        direction LR
        CPU_C["CPU<br/>━━━━━━━━<br/>Cores: 8-24<br/>Pro: Complex branching<br/>Con: Low parallelism<br/>Analogy: One genius"]
        CUDA_C["CUDA Core<br/>━━━━━━━━<br/>Cores: Thousands<br/>Pro: Massive parallel<br/>Con: No complex logic<br/>Analogy: 5,000 workers"]
        TC_C["Tensor Core<br/>━━━━━━━━<br/>Cores: Hundreds<br/>Pro: Best for matrix mult<br/>Con: Matrix only<br/>Analogy: 500 matrix experts"]
        NPU_C["NPU<br/>━━━━━━━━<br/>Cores: 16-32<br/>Pro: Low-power AI<br/>Con: No versatility<br/>Analogy: Low-power AI chip"]
    end

    style CPU_C fill:#FFB6C1
    style CUDA_C fill:#FFFACD
    style TC_C fill:#90EE90
    style NPU_C fill:#87CEEB

Detailed Comparison Table

FeatureCPUCUDA CoreTensor CoreNPU
Core Count8-24Thousands to Tens of thousandsHundreds16-32
Operation UnitScalarScalarMatrix (4×4)Diverse
FLOPs per ClockLowMediumExtremely HighMedium
Power Draw65-150W300-700W (Total GPU)Included in GPU5-15W
ProgrammingC/C++CUDA C++CUDA + LibrariesCore ML/ONNX
FlexibilityHighestHighLow (Matrix only)Very Low
Role in LLMsPre/Post-processingSoftmax, LayerNormMatrix Mult (~95%)Lightweight Inference
Game AnalogyGame LogicVertex/Fragment ShadersRT CoresMobile AI Chip

Data Flow during LLM Inference

flowchart TB
    Input["Input Tokens"] --> Embed["Embedding Lookup<br/>→ CUDA Core"]

    subgraph Layers["N × Transformer Layers"]
        direction TB
        LN1["LayerNorm → CUDA Core"]
        QKV["Q, K, V Matrix Mult → Tensor Core"]
        ATT["Attention Scores → Tensor Core + CUDA Core"]
        LN2["LayerNorm → CUDA Core"]
        FFN_MM["FFN Matrix Mult → Tensor Core"]
        ACT["Activation (SiLU) → CUDA Core"]
    end

    Embed --> LN1 --> QKV --> ATT --> LN2 --> FFN_MM --> ACT

    ACT --> Output["Output Token Probabilities → CUDA Core (Softmax)"]
    Output --> Sample["Token Sampling → CPU"]

Summary

1
2
3
4
5
6
7
8
9
10
11
Compute Pipeline for LLM Inference:

CPU:         Token pre/post-processing, sampling, I/O
              ↓
CUDA Core:   LayerNorm, Softmax, Activation functions (~5% FLOPs)
              ↓
Tensor Core: Matrix Multiplication - Attention, FFN (~95% FLOPs) ← The Key to Performance
              ↓
TMEM:        Holding Tensor Core intermediate results (Blackwell)
              ↓
VRAM (HBM):  Storage for weights, KV cache, activation data ← Bandwidth is the bottleneck

Key Points:

  1. CUDA Cores are general-purpose parallel units. In LLMs, they handle non-matrix operations (~5%).
  2. Tensor Cores are dedicated matrix multiplication accelerators. They determine ~95% of LLM inference performance.
  3. TMEM is Blackwell’s dedicated memory for Tensor Cores. It eliminates HBM round-trips, a key factor in FlashAttention-4 performance.
  4. NPUs are dedicated low-power AI chips. They are unsuitable for large LLMs but ideal for mobile/on-device lightweight AI.
  5. When choosing a GPU, the combination of Tensor Core Generation + VRAM Capacity + Memory Bandwidth determines LLM performance.

This document is a supplement to Section 7, “Hardware Configuration,” of How LLMs Work: A Guide for Game Developers. For an in-depth look at memory, please refer to the VRAM Deep Dive Guide.

This post is licensed under CC BY 4.0 by the author.