Unity Profiler Optimization

Posted Nov 15, 2023

Intermediate

By Sehyup

7 min read

Unity Profiler Optimization

Unity Addressables Optimization Guide
Unity Profiler Optimization
Unity Mobile Optimization Practical Guide - From Profiler to Memory Architecture
Unity & iOS Memory Architecture

Unity Addressables Optimization ... Unity Mobile Optimization Practi...

1. Unity Profiler structure
2. Profiler threads
3. Sample stack vs call stack
4. About markers
5. Finding bottlenecks
6. Graphics batching

Unity Profiler

A tool that lets you optimize quickly either in the Editor or from development builds.

Unity Profiler structure

Enable the Development Build option

Most additional options are not necessary. They mainly support auto profiler connection and deep profiling.
Auto-connect bakes the current PC’s IP, so automatic connection works only from the machine used to build.

Profiler - CPU module

You can inspect data per sample.
You can verify how much CPU time each process consumes.

Profiler - Chart view

Check whether the frame is processed faster than your target FPS. For 60 FPS, most work should finish within 16ms. For 30 FPS, within 33ms.
Check whether graph overload spikes occur.
If VSync is on, charts are effectively clamped around 60 FPS / 16ms. For profiling, turn VSync off.

Profiler - Details window, Timeline view

CPU usage time is easy to understand visually.
You can inspect all threads at a glance.
You can track timing and execution order relationships linearly.

Profiler - Details window, Hierarchy view

Understand parent-child call relationships.
Sort by the metric you care about.

Start by fixing the longest-running samples first.

Profiler - Threads

Main Thread
1. Unity Player Loop runs (Awake, Start, etc.)
2. MonoBehaviour scripts primarily run here

Render Thread
1. Thread that assembles commands to send to GPU
  Draw calls are issued on Main Thread, then executed through command assembly on Render Thread

Worker Threads (Job Threads)
1. Asynchronous parallel work from Job System, etc.
2. Compute-heavy tasks such as animation/physics run here
  Jobs are scheduled on Main Thread and processed on Worker Threads

There can be causality between methods across different threads even if they do not call each other directly.
ex1. Job scheduled > processed on worker thread
ex2. Main thread MeshRenderer.Draw() > graphics commands assembled on render thread
If main-thread work is delayed, the render thread can sit idle.

Enable Show Flow Events to inspect execution order and causality.

Sample stack vs call stack

Sample stack and call stack are different. Sample stacks are chunked and only include marked C# methods/code blocks.
Because of that, sampling is grouped coarsely. Unity does not sample every C# method call by default; it samples marked methods/blocks.

Notes for deep profiling

In Deep Profiling, every C# call (including constructors/properties) is marked.
This introduces heavy profiling overhead and can make data less accurate.

So deep profiling should be used only in a very limited scope and short time window.

How to enable call stack

You must enable the Call Stack button first (it becomes highlighted).
In Call Stack dropdown, choose the marker you want.

For specific samples, you can record full call stacks.
1. GC.Alloc: managed allocation occurred
2. UnsafeUtility.Malloc: unmanaged allocation that must be freed manually
3. JobHandle.Complete: main thread force-synchronized job completion
Not recommended for regular use; use only in limited cases.

About markers

1. Main loop markers

PlayerLoop: root of samples executed by player loop
BehaviourUpdate: holder for Update() samples
FixedBehaviourUpdate: holder for FixedUpdate() samples
EditorLoop: editor-only loop

2. Graphics markers (Main Thread)

WaitForTargetFPS
Time spent waiting for VSync / target framerate
Gfx.WaitForPresentOnGfxThread
Marker appears when render thread is waiting on GPU, and main thread also has to wait
Gfx.PresentFrame
Waiting for GPU to render current frame
If long, GPU-side processing is slow
GPU.WaitForCommands
Render thread is ready for new commands, but main thread is not feeding them yet, so it waits

Finding bottlenecks

Graphics markers are useful for identifying CPU/GPU bounds.
If Main Thread waits for Render Thread, bottleneck can be on thread handoff; render commands are generated around late player loop stages.
In other words, do not only ask “GPU or CPU”; also check cross-thread bottlenecks.

CPU Main Thread bound
Main thread is slow, so render thread waits

Render Thread bound
Still sending draw-call commands for previous frame

Worker Thread bound
Main thread is synchronously waiting for jobs to complete

Xcode Frame Debugger and newer Unity profiler versions can show CPU/GPU bound hints.

There are 4 major bottleneck types

CPU Main Thread bound
CPU Worker Thread bound (physics, animation, job system)
CPU Render Thread bound (CPU-side command assembly/transfer to GPU, not GPU core bottleneck itself)
GPU bound

Typical bottleneck triage flow
Main thread bottleneck? Optimize player loop first
If not, focus on physics/animation/job system
If still not, inspect render-thread bottleneck and then separate GPU vs CPU factors

If it is a render-thread CPU bottleneck
CPU graphics optimization
Camera/culling optimization
Reduce SetPass calls (batching)
Use graphics batching where possible: SRP batching, Dynamic batching, Static batching, GPU instancing

General facts

Before batching, one important point.
Graphics pipeline delays can come from:
CPU-side delay while assembling commands (more common than pure GPU weakness today)
CPU->GPU command/resource upload delay
GPU internal processing delay
Draw call means CPU sends render-execute command to GPU.
In many cases, CPU cost/upload delay from render state changes is heavier than the draw call itself.
Often the expensive part is just before “draw”.
GPU generally prefers fewer large meshes over many tiny meshes.
Many rendering issues are not from weak GPU compute, but from inefficient GPU usage.
Sending many tiny meshes wastes GPU execution units (Wavefront/Warp).
Example: if a unit processes 256 vertices and you keep feeding 128, utilization is wasted.

Graphics batching

1. SRP batching (URP, HDRP)

Before draw commands, repeatedly setting different render states (different shaders) is often the bigger cost.
Group meshes using the same shader & material
Bundle multiple draw calls under one SetPass Call (same shader variant)
Per-material data: upload once early in a large list
Per-object data: upload every frame in a large list
Select mesh from list using index/offset and call Draw()
Reducing the number of shaders used in the project helps optimization.

2. Static batching (Static)

GPU likes drawing large meshes at once. Concept: reduce transfer overhead.
Pre-merge non-moving meshes and bake -> upload to GPU ahead of time -> call DrawIndexed() per renderer
Very fast CPU/GPU processing
Unity Editor bakes it only when building the app
Downside: merged unique meshes increase memory usage

3. Dynamic batching (Dynamic)

GPU likes large meshes at once. Concept: reduce transfer overhead. -> generally not highly recommended.
Merge meshes every frame > run one Draw()
Optimized from GPU perspective
GPU receives one mesh/draw command, so processing is very fast
But CPU must merge meshes every frame
Fewer draw commands, but mesh-merging itself costs CPU
Baked every frame
In some cases, merging meshes costs more than having many draw calls

4. GPU instancing

Reduce command delivery from CPU to GPU.
For identical mesh + identical shader/material
Upload mesh data to GPU once
Per-instance unique data (object-to-world matrix) is sent as an array
Very fast CPU-side when drawing many identical objects
(<500 instances) very small-vertex meshes can be inefficient
GPU prefers large meshes; meshes with <=256 vertices often gain less.

Summary

Typical efficiency: SRP batching, Static batching > GPU instancing > Dynamic batching
CPU cost before draw calls, especially render state setup, is often larger than the draw call itself
Focus on reducing SetPass calls (SRP batching) before anything else, while still optimizing draw calls as needed
Before draw-call reductions (instancing/dynamic batching), enabling SRP batching and reducing shader variety is usually most effective
Turning on SRP batching and reducing shader kinds is often the highest impact.

Reduce SetPass calls!!

Use Frame Debugger to see why SetPass calls are not merged.
Set a target below 300 SetPass calls.

If GPU rendering is the bottleneck

Xcode GPU Frame Capture
Commands are listed in sequence; inspect time cost at each render stage.
You can find draws with abnormal cost -> find the shader/mesh used by that draw and optimize them.

Reference
Notes from lecture by Je-min Lee (Retro Unity Partnership Engineer).
IJEMIN GitHub

Unity, Optimization

This post is licensed under CC BY 4.0 by the author.

Unity Profiler Optimization

Table of Contents

Unity Profiler

Unity Profiler structure

Enable the Development Build option

Profiler - CPU module

Profiler - Chart view

Profiler - Details window, Timeline view

Profiler - Details window, Hierarchy view

Profiler - Threads

Sample stack vs call stack

Notes for deep profiling

How to enable call stack

About markers

1. Main loop markers

2. Graphics markers (Main Thread)

Finding bottlenecks

There are 4 major bottleneck types

General facts

Graphics batching

1. SRP batching (URP, HDRP)

2. Static batching (Static)

3. Dynamic batching (Dynamic)

4. GPU instancing

Summary

Reduce SetPass calls!!

If GPU rendering is the bottleneck

Trending Tags