UnifyEncodingForGlobals Pass - Implementation Plan

Overview

This pass unifies multiple encoded versions of the same immutable global to reduce memory footprint in IREE's Stream dialect. When a single source global (e.g., a model weight) is encoded multiple times with different encodings for different uses, this pass selects a unified encoding and updates all references.

Initial Prototype: https://github.com/hanhanW/iree/blob/hanhan-prototype-globals-with-multi-encoding-snapshot-20251022/compiler/src/iree/compiler/Dialect/Stream/Transforms/UnifyEncodingForGlobals.cpp

UnifyEncodingForGlobals Pass - Design Rationale

Below explains the design rationale for the UnifyEncodingForGlobals pass and why it meets IREE's bar as a retargetable compiler infrastructure.

1. Problem Statement

When the same source data (e.g., model weights) is used in multiple dispatch operations with different encoding requirements, IREE currently creates multiple encoded copies of the same data. For example, in LLaMA models:

// Same weight encoded twice with different iteration_sizes
%enc1 = stream.tensor.encode %weight -> tensor<4096x4096xf32, #encoding<iteration_sizes=[?, 4096, 4096]>>
%enc2 = stream.tensor.encode %weight -> tensor<4096x4096xf32, #encoding<iteration_sizes=[4, 4096, 4096]>>

This wastes memory because it may produce 2x encoded globals during execution, if they result in different layouts.

Goal: Unify multiple encodings of the same immutable source into a single encoding, reducing memory footprint.

2. Why Immutability Enables This Optimization

The foundation of this optimization rests on immutable globals:

Immutability Guarantee: IREE's VerifyInitializationOrderPass ensures immutable globals are initialized exactly once, only in initializers or initializer-only functions.
Deterministic Data: Since immutable globals never change after initialization, any computation derived purely from immutable globals produces deterministic results.
Safe Unification: If two stream.tensor.encode ops consume the same immutable source, we can safely use a unified encoding for both - the underlying data is guaranteed identical, which avoid memory footprint bloat.

This is why the pass strictly requires both source AND encoded globals to be immutable.

3. Hardware-Agnostic Encoding Selection via Resolver

3.1 The Resolver Architecture

IREE uses an encoding resolver infrastructure that is central to retargetability:

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  This Pass      │────▶│  Layout Resolver │────▶│  Backend-       │
│  (identifies    │     │  Interface       │     │  Specific       │
│   candidates)   │     │                  │     │  Implementation │
└─────────────────┘     └──────────────────┘     └─────────────────┘

Each backend registers its own resolver via AffinityAnalysisDialectInterface
Resolver knows device capabilities (GPU tile sizes, cache lines, vector widths)
Resolver returns optimal encoding for the target device
This pass queries the resolver, it doesn't hard-code encoding decisions

3.2 Why This Matters for Retargetability

The pass contains zero hardware-specific logic:

What the pass does	What the pass does NOT do
Identifies unification candidates	Choose specific tile sizes
Queries resolver for unified encoding	Assume GPU vs CPU layout
Updates IR with resolver's answer	Hard-code any encoding format

This means the same pass works for:

CPU with AVX-512 (8-wide vectors)
GPU with different warp/wavefront sizes
Custom accelerators with unique memory layouts
Future backends not yet implemented

3.3 Identity Encoding as Conservative Fallback

When no resolver is available or resolvers disagree:

auto unifiedEncoding = IREE::Encoding::IdentityAttr::get(context);

Why identity is always safe:

Identity means "no specialized layout transformation"
Every backend must handle identity encoding
Later passes (SpecializeEncodings) can re-encode if beneficial -- not implemented yet and it can belong to other passes.
Correctness is preserved; only optimization opportunity is deferred

4. Black-Box Principle and Subgraph Equivalence

4.1 The Black-Box Principle

All ops in the trace are treated as black boxes:

Identity = (op_name, attributes, operand_subgraphs)
Same identity → same output (referential transparency)
No need to look inside function bodies, executables, or regions

This applies uniformly to:

util.call @func(...) - function is black box, arguments are inputs
stream.tensor.dispatch @exec(...) - executable is black box, operands are inputs
tensor.extract_slice - op with attributes is black box
scf.if / scf.for - captured values are implicit inputs
Any other deterministic op

4.2 Subgraph Equivalence

Two stream.tensor.encode ops encode the same data if their source operand subgraphs are equivalent:

Same leaf sources: Both trace to the same immutable global (by name) or constant (by value)
Same intermediate ops: Same op types with same attributes along the path
Same structure: The DAG of operations is isomorphic

Example of equivalent subgraphs (unifiable):

// Path 1
%g1 = util.global.load @weights  // immutable
%s1 = tensor.extract_slice %g1[0,0][50,100][1,1]
%enc1 = stream.tensor.encode %s1 with encoding1

// Path 2 - equivalent subgraph, different encoding
%g2 = util.global.load @weights  // same immutable global
%s2 = tensor.extract_slice %g2[0,0][50,100][1,1]  // same slice params
%enc2 = stream.tensor.encode %s2 with encoding2

Both trace to the same immutable global through the same operations → valid for unification.

4.3 Control Flow as Black Boxes

For ops with regions (scf.if, scf.for), captured values act as implicit arguments:

%weights = util.global.load @weights
%cond = ...
%result = scf.if %cond -> tensor<...> {
  scf.yield %weights  // %weights captured from outside
} else {
  scf.yield %weights
}

Canonical form:

scf.if(
  cond: <cond_subgraph>,
  captured_inputs: [<weights_subgraph>],
  then_region: { yield input[0] },
  else_region: { yield input[0] }
)

Two scf.if results are equivalent if:

Same condition subgraph
Same captured input subgraphs
Same region structure (how inputs are used)

This unifies the model: functions/dispatches are black boxes with explicit arguments; control flow ops are black boxes with captured arguments plus region structure.

Why this works: In initializers, all inputs trace to immutable sources. Deterministic ops with same inputs produce same outputs.

4.4 Conservative Bail-Out

The analysis bails when equivalence cannot be proven:

Condition	Rationale
Mutable global in path	Value may differ between uses
Unknown/unhandled op	Cannot prove semantic equivalence
Non-equivalent subgraphs	Different computation paths

Principle: Better to miss an optimization than to introduce incorrectness.

5. Pass Ordering and Integration

5.1 Pipeline Position

CombineInitializersPass     ← Merges initializer blocks
        ↓
UnifyEncodingForGlobalsPass ← THIS PASS
        ↓
... (cleanup pipeline)    ← Later passes can deduplicate identical globals

// Reference: compiler/src/iree/compiler/Dialect/Stream/Transforms/Passes.cpp
passManager.addPass(IREE::Util::createCombineInitializersPass());
if (clUnifyEncodingForGlobals) {
  passManager.addPass(IREE::Stream::createUnifyEncodingForGlobalsPass());
}
// ... followed by cleanup pipeline

Why this position:

After CombineInitializers: All initializer code is in one place, simplifies analysis
Before cleanup: Later passes can deduplicate globals that now have identical encodings

6. Complete IR Update Chain

When unifying encodings, ALL related ops must be updated consistently:

// BEFORE
%size = stream.tensor.sizeof tensor<4096x4096xf32, #old_encoding>
%enc = stream.tensor.encode %source -> tensor<4096x4096xf32, #old_encoding> in !stream.resource<*>{%size}

// AFTER
%size = stream.tensor.sizeof tensor<4096x4096xf32, #unified_encoding>
%enc = stream.tensor.encode %source -> tensor<4096x4096xf32, #unified_encoding> in !stream.resource<*>{%size}

Both stream.tensor.sizeof and stream.tensor.encode must use the same encoding, otherwise the size calculation would be wrong.

7. Summary: Why This Meets IREE's Bar

Criterion	How Met
Correctness	Conservative bail-out on any ambiguity; identity encoding always safe
Retargetability	Zero hardware-specific logic; delegates to resolver infrastructure
Safety	Requires immutability for both source and encoded globals
Testability	Comprehensive positive AND negative test cases
Maintainability	Clean analysis/transform separation; follows existing patterns
Extensibility	Architecture ready for resolver integration and subgraph canonicalization

The pass prioritizes correctness over optimization - it will never introduce incorrect behavior, even if it means missing some optimization opportunities in complex cases.

Oct 30 '25 19:10 hanhanW

I'm brainstorming with claude and then I realized that DFX is not needed in this case. We only need simple traversal.

Collapsed into issue description above

Nov 25 '25 15:11 hanhanW

Many thanks to @benvanik 's Rewriting CombineInitializersPass work. It makes many things clearer to me.

Nov 25 '25 15:11 hanhanW

What does Mutable global in path mean here?

Does that only refer to the encoded globals and the source global?

I was wondering that in the context of subgraph equivalence: The RFC treats functions as black boxes, so two function calls of the same function with the same operands and attributes will be treated as equivalent subgraphs IIUC.

However, in the presence of any mutable global, which doesn't even have to be part of the chains between source and encoded globals, the same function with same operands and attributes could produce different outputs.

How would this be handled?

Nov 26 '25 12:11 sommerlukas

What does Mutable global in path mean here?

It refers to the source global. IMO, we should mark functions as immutable or mutable in the analysis. We bail out if we reach any mutable globals or functions.

Dec 08 '25 15:12 hanhanW

RFC: Implement the pass that unifies the encodings for each global.

UnifyEncodingForGlobals Pass - Implementation Plan

Overview

UnifyEncodingForGlobals Pass - Design Rationale

1. Problem Statement

2. Why Immutability Enables This Optimization

3. Hardware-Agnostic Encoding Selection via Resolver

3.1 The Resolver Architecture

3.2 Why This Matters for Retargetability

3.3 Identity Encoding as Conservative Fallback

4. Black-Box Principle and Subgraph Equivalence

4.1 The Black-Box Principle

4.2 Subgraph Equivalence

4.3 Control Flow as Black Boxes

4.4 Conservative Bail-Out

5. Pass Ordering and Integration

5.1 Pipeline Position

6. Complete IR Update Chain

7. Summary: Why This Meets IREE's Bar