Looking at the performance profile of my aoc day 3 solution
WIP: do NOT merge
I have set up a small benchmark in order to look at how Roc performs compared to python. The details of this benchmark are provided in this message on zulip: #advent of code > 2025 Day 3 @ π¬
Out of curiosity, Iβve looked into how to use tracy to get a better analysis of the bottlenecks of this, and itβs clearly still memory, with allocation/deallocation spending roughly 40% of the budget. Tracy is pretty cool!
Below in the next comment, Iβll add the analysis stage Iβm at, with the help of Claude.
Roc Interpreter Performance Analysis
Date: December 16, 2025 (Updated)
Benchmark: bench_fold_append.roc with n=1000
Profiler: Tracy
Executive Summary
The Roc interpreter shows significant performance overhead compared to Python for a simple fold+append benchmark. Through systematic profiling and optimization, we've identified and addressed the major bottlenecks:
Completed Optimizations
- β Method resolution: ~~24%~~ β 6% after caching (74% reduction)
- β Memory management: ~~393,465 allocations~~ β 327,492 after tuple scratch buffers (16.8% reduction)
- β HashMap initialization: Pre-allocated hot caches to avoid repeated small reallocations
Remaining Bottlenecks
- Type system operations: 578,522 type variables created for 1,000 elements
- Memory management: Still ~327K allocations (mostly type store growth)
- Dot access overhead: 22% handling method calls on values
The actual list operations (callLowLevelBuiltin) account for only 0.76% of runtime.
Optimization Progress
β Optimization 1: Method Closure Caching
Problem: resolveMethodFunction was called 102,797 times, re-evaluating method closures on every call even for the same method on the same type.
Solution: Added evaluated_method_cache in interpreter.zig that caches evaluated method closures by (origin_module, nominal_ident, method_name_ident, receiver_layout_idx).
Results:
| Metric | Before | After | Improvement |
|---|---|---|---|
resolveMethodFunction % |
23.6% | 6.14% | 74% reduction |
| Mean time per call | 23,755 ns | 5,152 ns | 4.6x faster |
| Cache hit rate | N/A | 88% | 89,232 hits / 101,277 calls |
resolveMethod.evalWithExpectedType calls |
102,763 | 12,045 | 88% reduction |
Code location: src/eval/interpreter.zig lines 206-226 (cache types), 7913-7953 (cache lookup), 8012-8046 (cache population)
β Optimization 2: HashMap Pre-allocation
Problem: Hot HashMaps (translate_cache, poly_cache, method caches) started with capacity 0, causing multiple small reallocations during warmup.
Solution: Pre-allocate capacity for frequently-used HashMaps in Interpreter.init():
try result.translate_cache.ensureTotalCapacity(512);
try result.poly_cache.ensureTotalCapacity(128);
try result.method_resolution_cache.ensureTotalCapacity(64);
try result.evaluated_method_cache.ensureTotalCapacity(64);
try result.flex_type_context.ensureTotalCapacity(256);
try result.rigid_subst.ensureTotalCapacity(128);
try result.instantiate_scratch.ensureTotalCapacity(64);
Results:
- Eliminates ~100-200 allocations during warmup
- Reduces HashMap reallocations from ~5-7 per map to 0-1
- Minimal memory overhead (~20KB total)
Code location: src/eval/interpreter.zig lines 555-563
β Optimization 3: Tuple Scratch Buffers
Problem: Each tuple finalization allocated 3 temporary arrays (layouts, values, rt_vars), immediately freed after use. For 22,011 tuples, this caused 66,033 allocations.
Solution: Added reusable scratch buffers to the Interpreter struct:
tuple_scratch_layouts: std.array_list.Managed(Layout),
tuple_scratch_values: std.array_list.Managed(StackValue),
tuple_scratch_rt_vars: std.array_list.Managed(types.Var),
Each tuple clears and reuses these buffers instead of allocating:
self.tuple_scratch_layouts.clearRetainingCapacity();
try self.tuple_scratch_layouts.resize(total_count);
var elem_layouts = self.tuple_scratch_layouts.items;
Results:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Total allocations | 393,465 | 327,492 | -65,973 (-16.8%) |
| Total frees | 380,984 | 315,008 | -65,976 (-16.8%) |
| Tuple temp arrays | 22,011 | 0 | -22,011 (-100%) |
Code location: src/eval/interpreter.zig lines 385-388 (struct fields), 551-553 (init), 13369-13384 (usage)
Current Profile (After All Optimizations)
Key Metrics
| Metric | Original | After Optimizations | Improvement |
|---|---|---|---|
| Total allocations | 393,465 | 327,492 | -16.8% β |
| Total frees | 380,984 | 315,008 | -16.8% β |
| Method resolution time | 23.6% | 6.1% | -74% β |
Remaining Bottleneck Categories
| Category | % of Total | Key Functions | Status |
|---|---|---|---|
| Memory Management | ~35% | alloc (19%), free (16%) |
π‘ Reduced from 39% |
| Type System Operations | ~30% | typesStore.fresh*, instantiateType |
π΄ Investigate |
| Dot Access | ~20% | cont.dot_access_* |
π‘ Acceptable |
| Method Resolution | 6.1% | resolveMethodFunction |
β Optimized |
| Tuple Operations | 3.5% | cont.tuple_collect |
β Optimized |
Why instantiateType Calls Remain High
The 35,064 instantiateType calls break down as:
-
23,041 calls from
sched.call- Required for every closure invocation -
~12,000 calls from
cont.dot_access_collect_args- Only on cache misses
This is expected behavior: type instantiation is needed for proper polymorphism when invoking functions. The method caching successfully reduced redundant type work from method resolution.
Benchmark Description
Roc Code (bench_fold_append.roc)
benchmark_fold_append = |digits| {
(_m, max_rest) = digits.fold_rev((0, []), |d, (m, rest)| {
new_max = if d > m { d } else { m }
(new_max, rest.append((d, m)))
})
max_rest
}
Python Equivalent (bench_fold_append.py)
def benchmark_fold_append(digits):
max_val = 0
max_rest = []
for d in reversed(digits):
new_max = d if d > max_val else max_val
max_rest.append((d, max_val))
max_val = new_max
return max_rest
Observed Performance
| Size | Roc (s) | Python (s) | Ratio |
|---|---|---|---|
| 1000 | ~0.5s | ~0.001s | ~500x slower |
Tracy Profiling Results
Top 20 Functions by Time
| Rank | Function | % Time | Calls | Mean (ns) | Notes |
|---|---|---|---|---|---|
| 1 | applyContinuation |
73.3% | 716,580 | 10,559 | Main dispatch loop |
| 2 | alloc |
25.6% | 563,830 | 4,680 | Memory allocation |
| 3 | scheduleExprEval |
25.5% | 625,541 | 4,203 | Expression scheduling |
| 4 | evalWithExpectedType |
24.0% | 114,874 | 21,601 | Type-directed eval |
| 5 | resolveMethodFunction |
23.6% | 102,797 | 23,755 | Method lookup |
| 6 | free |
22.8% | 563,394 | 4,182 | Memory deallocation |
| 7 | cont.binop_apply |
16.9% | 68,870 | 25,270 | Binary operators |
| 8 | sched.call |
15.2% | 23,041 | 68,154 | Call scheduling |
| 9 | instantiateType |
14.0% | 35,064 | 41,311 | Type instantiation |
| 10 | cont.dot_access_collect_args |
13.5% | 12,034 | 116,017 | Dot access args |
| 11 | cont.reassign_value |
13.1% | 34,955 | 38,670 | Var reassignment |
| 12 | cont.dot_access_resolve |
8.7% | 33,927 | 26,505 | Dot access resolution |
| 13 | cont.tuple_collect |
8.4% | 66,033 | 13,069 | Tuple construction |
| 14 | sched.call.anno_check |
6.0% | 22,034 | 28,236 | Annotation checking |
| 15 | typesStore.resolveVar |
4.8% | 3,204,487 | 154 | 3.2M calls! |
| 16 | evalLookupLocal |
4.5% | 225,921 | 2,044 | Local var lookup |
| 17 | cont.call_invoke_closure |
4.9% | 23,041 | 21,809 | Closure invocation |
| 18 | unifyGuarded |
3.6% | 103,270 | 3,566 | Type unification |
| 19 | cont.bind_decl |
3.6% | 34,015 | 10,829 | Declaration binding |
| 20 | translateTypeVar |
3.6% | 528,012 | 696 | Type translation |
Categorized Time Breakdown
| Category | % of Total | Key Functions |
|---|---|---|
| Memory Management | 48.4% | alloc (25.6%), free (22.8%) |
| Method/Dot Access | 45.8% | resolveMethodFunction (23.6%), cont.dot_access_* (22.2%) |
| Type System | ~25% | instantiateType (14%), unify* (~11%), translateTypeVar (3.6%) |
| Tuple Operations | ~11% | cont.tuple_collect (8.4%), tuple_collect.* (3%) |
| Binary Operators | 16.9% | cont.binop_apply |
| Actual List Ops | 0.63% | callLowLevelBuiltin |
Detailed Analysis
1. Memory Management (48.4%)
Observation: 563,830 allocations and 563,394 frees for processing 1,000 elements.
Per-iteration allocations: ~563 allocations per element processed.
Root causes:
- Every tuple
(new_max, rest.append(...))allocates - Every closure invocation may allocate for bindings
- Pattern matching creates copies via
pushCopy - Type instantiation allocates new type variables
Evidence:
alloc: 563,830 calls, 4,680 ns mean
free: 563,394 calls, 4,182 ns mean
2. Method Resolution (23.6%)
Observation: resolveMethodFunction called 102,797 times (103x per iteration).
Problem: Every call to rest.append(...) triggers method resolution:
- Look up the method name "append"
- Find the implementation for the receiver type
- Resolve any polymorphic constraints
Evidence:
resolveMethodFunction: 102,797 calls, 23,755 ns mean = 2.44 seconds total
Expected: Method resolution should be cached after first lookup for the same type.
3. Type Instantiation (14.0%)
Observation: instantiateType called 35,064 times with 41,311 ns mean.
Problem: Polymorphic functions like append : List a, a -> List a require type instantiation on each call. The interpreter appears to re-instantiate types even for monomorphic call sites.
Evidence:
instantiateType: 35,064 calls, 41,311 ns mean = 1.45 seconds total
typesStore.resolveVar: 3,204,487 calls (3.2 million!)
4. Dot Access Overhead (22.2%)
Observation: cont.dot_access_collect_args and cont.dot_access_resolve together take 22.2%.
Problem: The dot access path (rest.append(...)) involves:
- Evaluate receiver (
rest) - Collect arguments
- Resolve method
- Create closure call
This multi-step process is expensive for simple method calls.
Evidence:
cont.dot_access_collect_args: 12,034 calls, 116,017 ns mean
cont.dot_access_resolve: 33,927 calls, 26,505 ns mean
5. Type System Operations (~25% combined)
Observation: Unification and type operations account for significant overhead.
| Function | Calls | Time % |
|---|---|---|
unifyGuarded |
103,270 | 3.6% |
unifyVars |
58,269 | 2.8% |
unifyFlatType |
57,168 | 2.5% |
unifyStructure |
58,168 | 2.6% |
unifyNominalType |
35,137 | 1.1% |
unifyTuple |
22,000 | 1.0% |
Problem: The interpreter performs runtime type checking/unification that a compiled version would resolve at compile time.
6. The Non-Problem: List Copying
Observation: callLowLevelBuiltin (which handles list_append) is only 0.63%.
Initial hypothesis was wrong: I initially suspected O(nΒ²) list copying due to refcount > 1 preventing in-place mutation. While this may still occur, it's not the dominant cost.
Evidence:
callLowLevelBuiltin: 114,797 calls, 568 ns mean = 0.065 seconds
copyToPtr: 403,917 calls, 114 ns mean = 0.046 seconds
Optimization Recommendations
High Priority (>10% impact potential)
1. Cache Method Resolution
Current: resolveMethodFunction called 102,797 times for same method
Target: Cache by (receiver_type, method_name) -> resolved_function
Expected impact: -20% runtime
2. Reduce Allocations
Current: 563K allocations for 1K elements Target:
- Pool allocators for common sizes
- Stack allocation for small temporaries
- Avoid copying in pattern matching when value is consumed Expected impact: -30% runtime
3. Cache Type Instantiation
Current: instantiateType called 35K times
Target: Cache instantiated types for monomorphic call sites
Expected impact: -10% runtime
Medium Priority (5-10% impact potential)
4. Optimize Dot Access Path
Current: Multi-continuation chain for method calls Target: Fast path for simple method calls Expected impact: -10% runtime
5. Reduce Type Variable Resolution
Current: 3.2M calls to typesStore.resolveVar
Target: Cache resolved variables, avoid repeated resolution
Expected impact: -5% runtime
6. Binary Operator Optimization
Current: cont.binop_apply at 16.9%
Target: Inline common operations (int comparison, addition)
Expected impact: -10% runtime
Lower Priority
7. Tuple Collection Optimization
- Pre-allocate tuple memory
- Avoid intermediate allocations
8. Pattern Matching Optimization
- Move semantics instead of copy
- Avoid refcount increment when source is consumed
Relevant Code Locations
| Function | File | Line |
|---|---|---|
resolveMethodFunction |
interpreter.zig | 7822 |
instantiateType |
interpreter.zig | 9131 |
applyContinuation |
interpreter.zig | 12983 |
scheduleExprEval |
interpreter.zig | 10292 |
pushCopy |
interpreter.zig | 893 |
copyToPtr |
StackValue.zig | 272 |
callLowLevelBuiltin |
interpreter.zig | 1158 |
cont.dot_access_resolve |
interpreter.zig | 15272 |
cont.dot_access_collect_args |
interpreter.zig | 15644 |
Next Steps
Completed β
- ~~Investigate
resolveMethodFunction~~ - Found that method closures were being re-evaluated on every call - ~~Add method resolution caching~~ - Implemented
evaluated_method_cachewith 88% hit rate (74% reduction) - ~~Pre-allocate HashMaps~~ - Eliminated 100-200 warmup allocations
- ~~Implement tuple scratch buffers~~ - Eliminated 66,033 allocations (16.8% reduction)
Current Investigation π
High Priority: Type Variable Explosion
Problem: Creating 578,522 type variables for processing 1,000 elements
Breakdown:
-
typesStore.fresh: 271,813 calls -
typesStore.freshFromContent: 306,709 calls - Total: 578,522 type variables created
Questions to investigate:
- Why do we create 578 type variables per element processed?
- Are these type variables truly necessary for correctness?
- Can we cache or reuse type variables for monomorphic cases?
- Is polymorphic instantiation creating excessive type variables?
Expected: For a simple fold_rev with tuples, we might expect:
- ~22,000 type variables for tuples (one per tuple)
- ~35,000 for function instantiation
- Total expected: ~60,000-80,000
Actual: 578,522 (7-10x more than expected!)
Remaining Targets π―
Medium Priority: Remaining Memory Allocations (~327K)
With tuple allocations eliminated, remaining sources:
- Type store growth: ~60,000-100,000 allocations (efficient exponential growth)
- HashMap growth: ~100-200 allocations (now pre-allocated)
- Other operations: ~200,000+ allocations (to be profiled)
Potential approach: Cache instantiated types for identical (function_type, argument_types) pairs when types are fully concrete.
Medium Priority: Binary Operator Overhead (2.8%)
Problem: cont.binop_apply has high per-call cost.
Solution: Inline common operations (int comparison, addition) without continuation overhead.
Lower Priority
- Tuple operations: Pre-allocate tuple memory, avoid intermediate copies
-
typesStore.resolveVar: 2.8M calls - consider caching resolved results
Appendix: Raw Tracy Data
See bench_fold_append_tracy.csv for complete profiling data.
Data Collection Method
# Build with Tracy
zig build roc -Dtracy=$HOME/git/wolfpld/tracy -Dtracy-callstack=true
# Run benchmark
cd benchmarks
roc build bench_fold_append.roc
echo 1000 | ./bench_fold_append
# Export from Tracy GUI to CSV
# save bench_fold_append.tracy from the GUI, and then
tracy-csvexport bench_fold_append.tracy > bench_fold_append_tracy.csv