Post

Unity Mobile Optimization Practical Guide - From Profiler to Memory Architecture

Unity Mobile Optimization Practical Guide - From Profiler to Memory Architecture
Optimization Series (3 / 4)
  1. Unity Addressables Optimization Guide
  2. Unity Profiler Optimization
  3. Unity Mobile Optimization Practical Guide - From Profiler to Memory Architecture
  4. Unity & iOS Memory Architecture
Visitors

Introduction

Optimization is unavoidable in mobile game development. Unlike PC or console, mobile devices always run under three constraints: limited memory, thermal throttling, and battery drain. No matter how fun the game is, players leave if frame rate drops due to heat or if the app is terminated because of memory pressure.

This document summarizes practical optimization methods you can apply in Unity mobile projects. It covers how to read the profiler, graphics batching, asset bundle optimization, shader variant management, and iOS memory architecture.

This guide is based on official Unity sessions and hands-on profiling experience in production. The best strategy can differ by project, so always verify with direct profiler measurements.


Part 1 : Mastering the Unity Profiler

Optimizing without profiling is like sailing without a map. Do not rely on feeling like “it seems slow”. Find bottlenecks from profiler numbers.

1. Core CPU profiler principle

Before opening profiler details, first check your frame budget based on target FPS.

Target FPSFrame BudgetMeaning
60 fps16.67 msMost work must finish within ~16ms
30 fps33.33 msAll work must finish within ~33ms
Frame Budget Distribution Example - 60fps (16.67ms) Baseline

If the profiler graph shows frames beyond this budget, that is your bottleneck.


Profile with VSync disabled. If VSync is on, charts are clamped to 16ms (60fps), making real processing time invisible. For accurate profiling, always measure with VSync off.


2. Unity multi-thread architecture

Unity is a multi-core engine. To read the timeline correctly, you must understand each thread’s role.

A restaurant analogy works well. The Main Thread is the head chef taking orders and deciding execution order. The Render Thread is the server delivering completed dishes to customers. Worker (Job) Threads are assistant cooks handling time-consuming prep work in parallel.

flowchart LR
    subgraph Main["Main Thread"]
        M1["Player Loop<br/>(Awake, Start, Update...)"]
        M2["MonoBehaviour scripts"]
        M3["Draw call commands"]
    end

    subgraph Render["Render Thread"]
        R1["Assemble GPU commands"]
        R2["Send to GPU"]
    end

    subgraph Worker["Worker (Job) Threads"]
        W1["Animation bone processing"]
        W2["Physics simulation"]
        W3["Job system tasks"]
    end

    M3 -->|Draw request| R1
    M1 -->|Schedule jobs| W1
ThreadRoleMain work
Main ThreadOrchestrates game logicPlayer Loop, MonoBehaviour, draw call requests
Render ThreadGPU communicationBuild graphics commands and send to GPU
Worker ThreadsParallel compute-heavy tasksAnimation bones, physics simulation, job system

There is causality between threads. If Main Thread schedules jobs, Worker Threads execute them. If Main Thread requests draw calls, Render Thread assembles commands.

Enable profiler Show Flow Events to visualize cross-thread execution order and causality.


3. Sampling vs Deep Profiling

In the profiler, Sample Stack and Call Stack are different. Sample Stack shows only Unity-marked C# methods and code blocks in chunks, so many operations appear as larger grouped blocks.

So should you enable Deep Profiling to see everything? Theoretically yes, but not recommended for regular production diagnosis.

ModeProsCons
Normal samplingLow overhead, close to real performanceOnly marked methods are visible
Deep profilingCan trace every method callHeavy profiling overhead -> distorted/less reliable data

Use deep profiling only in a narrow scope and short capture window.

You can also enable call stack recording selectively for specific sample types.

Call stack targetMeaning
GC.AllocTrace where managed allocations occur
UnsafeUtility.MallocUnmanaged allocation (manual free required)
JobHandle.CompletePoint where Main Thread forces job synchronization


4. Reading graphics markers

To diagnose bottlenecks correctly, understand common graphics markers in timeline view.

MarkerMeaningTypical cause
WaitForTargetFPSTime waiting for target framerateAppears when VSync is enabled (normal)
Gfx.WaitForPresentOnGfxThreadMain Thread also waits because Render Thread is waiting on GPURender thread bottleneck
Gfx.PresentFrameWaiting for GPU to finish current frameGPU processing delay
GPU.WaitForCommandsRender Thread is ready but Main Thread cannot feed commandsMain thread bottleneck


5. Bottleneck identification strategy

Unity bottlenecks are broadly classified into four types. The important question is not just “CPU vs GPU” but which thread is bottlenecked.

flowchart TD
    A["Frame budget exceeded"] --> B{"Any wait marker<br/>on Main Thread?"}
    B -->|No - Main Thread itself is busy| C["1) CPU Main Thread bound<br/>-> Optimize Player Loop"]
    B -->|Yes| D{"Which wait marker?"}
    D -->|GPU.WaitForCommands| E{"Are worker threads<br/>busy?"}
    D -->|Gfx.WaitForPresent| F["3) CPU Render Thread bound<br/>-> Optimize draw calls/batching"]
    D -->|Gfx.PresentFrame| G["4) GPU bound<br/>-> Optimize shader/resolution/overdraw"]
    E -->|Yes| H["2) CPU Worker Thread bound<br/>-> Optimize physics/animation/jobs"]
    E -->|No| C
Bottleneck typeMain causeOptimization direction
1) Main ThreadHeavy scripts, GC allocsBetter algorithms, caching, reduce GC
2) Worker ThreadsPhysics/animation overloadReduce physics workload, use LOD
3) Render ThreadToo many draw calls / SetPass callsBatching strategy, shader consolidation
4) GPUOverdraw, heavy shadersAdjust resolution, simplify shaders

Part 2 : Graphics Optimization

6. The real cost behind draw calls

People often say “reduce draw calls”, but in modern mobile games, the bigger cost is often render-state setup before draw calls. SetPass Call caused by switching shaders is frequently the main CPU cost.

Understanding GPU architecture also explains why tiny meshes are inefficient.

GPU renders one large high-vertex mesh much faster than many tiny meshes. Execution units such as Wavefront (NVIDIA) / Warp (AMD) run fixed-size thread groups. If a unit can process 256 vertices but receives only 128, half is wasted.

In many cases, performance loss is not weak GPU compute power but inefficient GPU utilization.


7. Batching strategy comparison

Unity provides four major batching methods. Choose based on project characteristics.


SRP Batching (URP / HDRP)

This method targets the fact that render-state setup right before draw can cost more than draw commands themselves. It groups objects using the same shader variant and packs many draw calls under one SetPass Call.

  • Key point: optimization naturally improves when you reduce shader variety in the project
  • Enabling SRP Batching and minimizing shader count is usually the highest-impact strategy


Static Batching

It combines non-moving meshes at build time and sends one large mesh to GPU.

  • Pros: no runtime merge overhead (baked at build time)
  • Cons: higher memory usage due to merged mesh data


Dynamic Batching

It merges small meshes on CPU every frame and sends merged data to GPU.

  • Generally not recommended. GPU-side can improve, but CPU merge cost each frame can hurt overall performance.


GPU Instancing

When drawing the same mesh many times, upload mesh data to GPU once and render repeatedly with different per-instance data.

  • Effective for many repeated meshes (trees, grass, crowds)
  • Efficiency drops for meshes with around 256 vertices or fewer


Batching strategy summary

MethodCPU CostGPU EfficiencyMemoryRecommendation
SRP BatchingLowHighNo major changeHigh
Static BatchingNone at runtimeHighIncreasedHigh
GPU InstancingLowHighSlight increaseMedium
Dynamic BatchingHighMediumNo major changeLow


Before - SRP Batcher broken
// Material.SetFloat creates material instances,
// breaking SRP Batcher -> SetPass Call increases!
public class EnemyFlash : MonoBehaviour
{
    Renderer _renderer;

    void Start()
        => _renderer = GetComponent<Renderer>();

    public void OnHit()
    {
        // BAD: Creates a new material instance
        _renderer.material.SetFloat("_FlashAmount", 1f);
    }
}
After - Use MaterialPropertyBlock
// MaterialPropertyBlock keeps shared material,
// only changes per-instance values -> SRP Batcher preserved!
public class EnemyFlash : MonoBehaviour
{
    Renderer _renderer;
    MaterialPropertyBlock _mpb;

    static readonly int FlashAmount
        = Shader.PropertyToID("_FlashAmount");

    void Start()
    {
        _renderer = GetComponent<Renderer>();
        _mpb = new MaterialPropertyBlock();
    }

    public void OnHit()
    {
        // GOOD: Keep SRP Batcher
        _mpb.SetFloat(FlashAmount, 1f);
        _renderer.SetPropertyBlock(_mpb);
    }
}

Set a target of under 300 SetPass Calls. Frame Debugger shows why SetPass calls are not merged, and you can use that data to drive shader consolidation.


8. Diagnosing GPU render bottlenecks

If GPU bottleneck is suspected, use Xcode GPU Frame Capture to inspect per-stage render cost. In the command timeline, find unusually expensive draws, identify the shader/mesh behind them, and optimize those targets.


Part 3 : Asset Optimization

9. Addressable & AssetBundle optimization

The most critical issue when using Addressables is duplicate dependencies.

If two assets in different groups reference the same dependency (for example, shader or texture), that dependency can be included twice in separate bundles and loaded into memory twice.

flowchart LR
    subgraph Before["Duplicate dependency problem"]
        A1["Asset Group A"] -->|Reference| S1["Shader X (Copy 1)"]
        B1["Asset Group B"] -->|Reference| S2["Shader X (Copy 2)"]
    end

    subgraph After["Solution: move to dedicated group"]
        A2["Asset Group A"] -->|Reference| S3["Shader Group"]
        B2["Asset Group B"] -->|Reference| S3
    end

Solution: separate duplicate dependencies (especially shaders) into dedicated groups. Addressables Analyze can detect duplicate dependencies automatically.


AssetBundle size balance

Bundles that are too small or too large both cause problems.

SituationProblem
Bundle too smallBundle objects themselves increase memory usage. More WebRequest/File IO -> more CPU time and thermal load. Partial-load benefits of LZ4 become weaker
Bundle too largeHarder to unload. Whole bundle may be loaded even if only part is needed


Additional optimization tips

ItemDescription
If AssetReference is not usedUncheck Include GUIDs in Catalog -> reduce catalog size
Catalog formatUse Binary instead of JSON -> faster parsing and first-layer security benefit
Max Concurrent Web RequestsMobile has lower practical concurrent request limits, so reduce from default 500
CRC checkIf enabled, bundle integrity can be verified (tamper detection)


10. Shader variant optimization

Shader variants are often overlooked in mobile optimization, but impact is large. If one shader uses many keywords, each keyword combination creates a separate variant. If you also support multiple graphics APIs (OpenGL ES, Vulkan, etc.), variant count grows multiplicatively.

Every shader variant can trigger SetPass Calls. Reducing variant count directly helps draw-call-side performance.


Variant optimization checklist

ItemMethod
Remove unnecessary keywordsMerge shaders with similar roles and disable unused keywords
Addressable shader groupWithout dedicated shader grouping, duplicate variants are included in multiple bundles
Lightmap mode cleanupDisable unused Lightmap Modes to explicitly remove related keywords
Graphics API cleanupDisable unused APIs -> prevent per-API variant multiplication
URP strip settingsEnable shader stripping options in URP settings
Code strippingAdjust Managed Stripping Level to remove unused code and related keywords


Use Project Auditor

Project Auditor is Unity’s static analysis tool for assets, project settings, and scripts. It is especially useful for reducing shader variants.

A practical elimination workflow:

  1. Clear previous build cache
  2. Enable Project Settings > Graphics > Log Shader Compilation
  3. Build with Development Build enabled
  4. Check compiled variant list in Project Auditor
  5. Identify unnecessary variants and clean related keywords


Be careful with materials not included in the player build. Keywords declared by shader_feature are stripped if no build-included material uses them. But material references from Addressable bundles can change strip decisions at build time, so consider custom strip scripts using IPreprocessShaders.


Part 4 : Understanding Memory Architecture

11. iOS memory architecture

To optimize mobile memory correctly, you need OS-level understanding of memory management. This section explains with iOS examples, but core concepts are similar on Android.

Physical memory vs virtual memory

Apps do not allocate directly in physical RAM. Allocations are made in virtual memory (VM), and VM pages (4KB or 16KB) are mapped into physical memory.

Why this matters: it is common to allocate 1.78GB in VM while actual physical usage is around 380MB. High VM size alone is not automatically a problem. What matters most is physical memory usage.


Dirty vs Clean memory

iOS classifies memory pages into three groups. This classification is central to optimization.

TypeContentsExamplesPhysical residency
DirtyDynamically allocated data, modified frameworks, Metal API resourcesHeap objects, texturesHigh
Dirty CompressedDirty pages rarely accessed, compressed by OSOld cachesMedium
CleanMapped files, read-only frameworks, app binaries (static code).dylib, executable codeLow
flowchart TB
    subgraph Footprint["Memory Footprint (actual app usage)"]
        D["Dirty memory<br/>dynamic allocations, Metal resources"]
        DC["Dirty Compressed<br/>compressed inactive pages"]
    end

    subgraph NonFootprint["Outside footprint"]
        C["Clean memory<br/>binary/read-only data<br/>(can be evicted from physical memory)"]
    end

Memory Footprint = Dirty + Dirty Compressed. This is what the app actually occupies. If this exceeds iOS limits, the app is killed (OOM Kill).

Dirty memory is the top optimization priority. Dirty pages must remain in physical memory, like a guaranteed minimum cost. Reducing dynamic allocations (including GC allocs) directly reduces Dirty memory.


12. Unity memory architecture

Unity is a C++ engine running a .NET VM. Core systems are in C++, while gameplay code is controlled in C#. So loading one asset can allocate memory in both C++ native memory and C# managed memory.

flowchart TB
    subgraph VM["Virtual Memory Region"]
        direction LR
        subgraph Native["Native (C++)"]
            N1["Asset data"]
            N2["Engine internal objects"]
        end
        subgraph Graphics["Graphics"]
            G1["Metal/Vulkan<br/>GPU resources"]
        end
        subgraph Managed["Managed (C#)"]
            MA["Managed heap<br/>(dynamic allocations)"]
            MS["Scripting stack<br/>(local variables)"]
            MV["VM memory<br/>(generics, reflection)"]
        end
        subgraph Other["Other"]
            O1["Binary (Clean)"]
            O2["Native plugin"]
        end
    end
AreaDirty/CleanDescription
Native (C++)DirtyAsset data, engine internal objects
GraphicsDirtyGPU allocations via Metal/Vulkan
Managed (C#)DirtyHeap objects, stacks, VM memory
Executable/MappedCleanBinaries, DLLs (evictable)
Native PluginMixedPlugin binaries are Clean, runtime allocations are Dirty


Managed memory deep dive

Understanding C# GC behavior helps prevent memory fragmentation.

Unity GC allocator generally works like this:

  1. Reserve memory pools (regions) and create blocks grouped by similar object sizes
  2. Allocate new objects into existing blocks
  3. If allocation does not fit -> create and allocate a custom block
  4. If still no space -> trigger GC -> if still not enough -> expand heap


Incremental GC is recommended. Normal GC expands heap only after collection if space is still insufficient. Incremental GC can expand while collecting, reducing frame spikes.

If Empty Heap Size is large, it is a sign of serious fragmentation. That means extra CPU overhead during allocation and larger unnecessary memory occupation.


VM memory cautions

VM memory (generics, type metadata, reflection) tends to grow continuously during runtime.

Ways to reduce it:

MethodDescription
Minimize reflectionReflection creates type metadata at runtime
Code strippingEngine code strip + managed stripping level tuning
Generic sharingAvailable from Unity 2022; shares code across generic instantiations

If code stripping is enabled while reflection-based code exists, runtime crashes may occur. Preserve required types explicitly in link.xml.


Part 5 : Using profiling tools

13. Unity Memory Profiler 1.1

Unity Memory Profiler is a snapshot-based memory analysis tool. Key tabs:

Allocated Memory Distribution

CategoryDescription
NativeAllocations from C++ native code
GraphicsGPU allocations from Metal/Vulkan
ManagedC# managed heap
Executable & MappedClean memory (binaries, DLLs)
UntrackedAllocations Unity could not classify (plugins, etc.)

A large Untracked value is not always a problem. For example, MALLOC_NANO may show 500MB allocated but only 3.3MB resident. Reserved heap space and actual usage are different.

Unity Objects tab

Shows three memory dimensions per object: Native Size, Managed Size, Graphics Size. This quickly reveals which assets consume the most memory.

Memory Map (hidden feature)

You cannot see concrete object names, but you can inspect which frameworks/binaries occupy memory at a high level.

Memory Profiler is a snapshot tool, so it is hard to answer “when and why this allocation happened.” For call-stack-level tracing, pair with native profilers such as Xcode Instruments.


14. Xcode Instruments

iOS deep memory analysis requires Xcode Instruments.

Prerequisite: include debug symbols in Xcode Build Settings.

Main metrics to inspect

MetricDescription
ResidentSize actually resident in physical memory
Dirty SizeDirty pages within virtual memory allocations
SwappedSwapped-out memory

Category mapping

Instruments CategoryUnity mapping
“GPU”Unity GPU processing (Graphics memory)
App AllocationsUnity CPU-side processing (Native + Managed)
IOSurface100% residency ratio -> must exist in physical memory
Binaries / CodeClean memory

If IOSurface residency is 100%, that memory is fully resident in physical memory. If it exceeds physical limits, the app is terminated.

Memory Graph is a native memory snapshot tool that visualizes object references.


Part 6 : Practical troubleshooting

15. Memory crash debugging flow

When the app crashes, first determine whether it is a memory issue or a different error.

flowchart TD
    A["App crash occurs"] --> B["Reproduce crash while playing<br/>with Xcode debugger attached"]
    B --> C{"Root cause?"}
    C -->|Out of memory| D["Take Memory Profiler snapshot<br/>(right before crash)"]
    C -->|Code error| E["Analyze call stack -> fix bug"]
    D --> F["Sort by Total Committed in<br/>Unity Objects / Summaries tabs"]
    F --> G["Identify top memory-consuming<br/>areas in order"]
    G --> H["Optimize corresponding asset/system"]

Key inspection sequence

  1. Confirm crash type: with Xcode debugger attached, determine memory crash vs code error
  2. If memory issue: in Memory Profiler, sort and investigate the largest Total Committed regions first
  3. Check texture Read/Write: when enabled, CPU-side copy is also kept -> disable unless strictly necessary

Mobile uses Unified Memory. CPU and GPU share the same physical memory, so GPU usage directly affects total memory budget. This differs from desktop GPUs with dedicated VRAM.


Conclusion

There is no silver bullet in optimization. Finding exact bottlenecks through profiling and making data-driven decisions is the only reliable approach.

Summary of the key points:

AreaCore strategy
CPU bottlenecksIdentify per-thread bottlenecks in timeline; minimize GC allocs
GraphicsPrioritize SRP Batching; reduce shader variety; target under 300 SetPass Calls
AssetsResolve Addressable duplicate dependencies; isolate shader groups; balance bundle size
ShadersAnalyze variants with Project Auditor; remove unnecessary keywords/APIs
MemoryDirty memory = top optimization target; enable Incremental GC
ToolsUse Unity Memory Profiler with Xcode Instruments together

Most importantly, optimization starts with measurement, not intuition. Optimizing without profiler data is like driving with your eyes closed. Start with profiler data and end with profiler data.

This post is licensed under CC BY 4.0 by the author.