roc icon indicating copy to clipboard operation
roc copied to clipboard

Looking at the performance profile of my aoc day 3 solution

Open mpizenberg opened this issue 1 month ago β€’ 1 comments

WIP: do NOT merge

I have set up a small benchmark in order to look at how Roc performs compared to python. The details of this benchmark are provided in this message on zulip: #advent of code > 2025 Day 3 @ πŸ’¬

Out of curiosity, I’ve looked into how to use tracy to get a better analysis of the bottlenecks of this, and it’s clearly still memory, with allocation/deallocation spending roughly 40% of the budget. Tracy is pretty cool!

Below in the next comment, I’ll add the analysis stage I’m at, with the help of Claude.

mpizenberg avatar Dec 15 '25 19:12 mpizenberg

Roc Interpreter Performance Analysis

Date: December 16, 2025 (Updated) Benchmark: bench_fold_append.roc with n=1000 Profiler: Tracy


Executive Summary

The Roc interpreter shows significant performance overhead compared to Python for a simple fold+append benchmark. Through systematic profiling and optimization, we've identified and addressed the major bottlenecks:

Completed Optimizations

  1. βœ… Method resolution: ~~24%~~ β†’ 6% after caching (74% reduction)
  2. βœ… Memory management: ~~393,465 allocations~~ β†’ 327,492 after tuple scratch buffers (16.8% reduction)
  3. βœ… HashMap initialization: Pre-allocated hot caches to avoid repeated small reallocations

Remaining Bottlenecks

  1. Type system operations: 578,522 type variables created for 1,000 elements
  2. Memory management: Still ~327K allocations (mostly type store growth)
  3. Dot access overhead: 22% handling method calls on values

The actual list operations (callLowLevelBuiltin) account for only 0.76% of runtime.


Optimization Progress

βœ… Optimization 1: Method Closure Caching

Problem: resolveMethodFunction was called 102,797 times, re-evaluating method closures on every call even for the same method on the same type.

Solution: Added evaluated_method_cache in interpreter.zig that caches evaluated method closures by (origin_module, nominal_ident, method_name_ident, receiver_layout_idx).

Results:

Metric Before After Improvement
resolveMethodFunction % 23.6% 6.14% 74% reduction
Mean time per call 23,755 ns 5,152 ns 4.6x faster
Cache hit rate N/A 88% 89,232 hits / 101,277 calls
resolveMethod.evalWithExpectedType calls 102,763 12,045 88% reduction

Code location: src/eval/interpreter.zig lines 206-226 (cache types), 7913-7953 (cache lookup), 8012-8046 (cache population)


βœ… Optimization 2: HashMap Pre-allocation

Problem: Hot HashMaps (translate_cache, poly_cache, method caches) started with capacity 0, causing multiple small reallocations during warmup.

Solution: Pre-allocate capacity for frequently-used HashMaps in Interpreter.init():

try result.translate_cache.ensureTotalCapacity(512);
try result.poly_cache.ensureTotalCapacity(128);
try result.method_resolution_cache.ensureTotalCapacity(64);
try result.evaluated_method_cache.ensureTotalCapacity(64);
try result.flex_type_context.ensureTotalCapacity(256);
try result.rigid_subst.ensureTotalCapacity(128);
try result.instantiate_scratch.ensureTotalCapacity(64);

Results:

  • Eliminates ~100-200 allocations during warmup
  • Reduces HashMap reallocations from ~5-7 per map to 0-1
  • Minimal memory overhead (~20KB total)

Code location: src/eval/interpreter.zig lines 555-563


βœ… Optimization 3: Tuple Scratch Buffers

Problem: Each tuple finalization allocated 3 temporary arrays (layouts, values, rt_vars), immediately freed after use. For 22,011 tuples, this caused 66,033 allocations.

Solution: Added reusable scratch buffers to the Interpreter struct:

tuple_scratch_layouts: std.array_list.Managed(Layout),
tuple_scratch_values: std.array_list.Managed(StackValue),
tuple_scratch_rt_vars: std.array_list.Managed(types.Var),

Each tuple clears and reuses these buffers instead of allocating:

self.tuple_scratch_layouts.clearRetainingCapacity();
try self.tuple_scratch_layouts.resize(total_count);
var elem_layouts = self.tuple_scratch_layouts.items;

Results:

Metric Before After Improvement
Total allocations 393,465 327,492 -65,973 (-16.8%)
Total frees 380,984 315,008 -65,976 (-16.8%)
Tuple temp arrays 22,011 0 -22,011 (-100%)

Code location: src/eval/interpreter.zig lines 385-388 (struct fields), 551-553 (init), 13369-13384 (usage)


Current Profile (After All Optimizations)

Key Metrics

Metric Original After Optimizations Improvement
Total allocations 393,465 327,492 -16.8% βœ…
Total frees 380,984 315,008 -16.8% βœ…
Method resolution time 23.6% 6.1% -74% βœ…

Remaining Bottleneck Categories

Category % of Total Key Functions Status
Memory Management ~35% alloc (19%), free (16%) 🟑 Reduced from 39%
Type System Operations ~30% typesStore.fresh*, instantiateType πŸ”΄ Investigate
Dot Access ~20% cont.dot_access_* 🟑 Acceptable
Method Resolution 6.1% resolveMethodFunction βœ… Optimized
Tuple Operations 3.5% cont.tuple_collect βœ… Optimized

Why instantiateType Calls Remain High

The 35,064 instantiateType calls break down as:

  • 23,041 calls from sched.call - Required for every closure invocation
  • ~12,000 calls from cont.dot_access_collect_args - Only on cache misses

This is expected behavior: type instantiation is needed for proper polymorphism when invoking functions. The method caching successfully reduced redundant type work from method resolution.


Benchmark Description

Roc Code (bench_fold_append.roc)

benchmark_fold_append = |digits| {
    (_m, max_rest) = digits.fold_rev((0, []), |d, (m, rest)| {
        new_max = if d > m { d } else { m }
        (new_max, rest.append((d, m)))
    })
    max_rest
}

Python Equivalent (bench_fold_append.py)

def benchmark_fold_append(digits):
    max_val = 0
    max_rest = []
    for d in reversed(digits):
        new_max = d if d > max_val else max_val
        max_rest.append((d, max_val))
        max_val = new_max
    return max_rest

Observed Performance

Size Roc (s) Python (s) Ratio
1000 ~0.5s ~0.001s ~500x slower

Tracy Profiling Results

Top 20 Functions by Time

Rank Function % Time Calls Mean (ns) Notes
1 applyContinuation 73.3% 716,580 10,559 Main dispatch loop
2 alloc 25.6% 563,830 4,680 Memory allocation
3 scheduleExprEval 25.5% 625,541 4,203 Expression scheduling
4 evalWithExpectedType 24.0% 114,874 21,601 Type-directed eval
5 resolveMethodFunction 23.6% 102,797 23,755 Method lookup
6 free 22.8% 563,394 4,182 Memory deallocation
7 cont.binop_apply 16.9% 68,870 25,270 Binary operators
8 sched.call 15.2% 23,041 68,154 Call scheduling
9 instantiateType 14.0% 35,064 41,311 Type instantiation
10 cont.dot_access_collect_args 13.5% 12,034 116,017 Dot access args
11 cont.reassign_value 13.1% 34,955 38,670 Var reassignment
12 cont.dot_access_resolve 8.7% 33,927 26,505 Dot access resolution
13 cont.tuple_collect 8.4% 66,033 13,069 Tuple construction
14 sched.call.anno_check 6.0% 22,034 28,236 Annotation checking
15 typesStore.resolveVar 4.8% 3,204,487 154 3.2M calls!
16 evalLookupLocal 4.5% 225,921 2,044 Local var lookup
17 cont.call_invoke_closure 4.9% 23,041 21,809 Closure invocation
18 unifyGuarded 3.6% 103,270 3,566 Type unification
19 cont.bind_decl 3.6% 34,015 10,829 Declaration binding
20 translateTypeVar 3.6% 528,012 696 Type translation

Categorized Time Breakdown

Category % of Total Key Functions
Memory Management 48.4% alloc (25.6%), free (22.8%)
Method/Dot Access 45.8% resolveMethodFunction (23.6%), cont.dot_access_* (22.2%)
Type System ~25% instantiateType (14%), unify* (~11%), translateTypeVar (3.6%)
Tuple Operations ~11% cont.tuple_collect (8.4%), tuple_collect.* (3%)
Binary Operators 16.9% cont.binop_apply
Actual List Ops 0.63% callLowLevelBuiltin

Detailed Analysis

1. Memory Management (48.4%)

Observation: 563,830 allocations and 563,394 frees for processing 1,000 elements.

Per-iteration allocations: ~563 allocations per element processed.

Root causes:

  • Every tuple (new_max, rest.append(...)) allocates
  • Every closure invocation may allocate for bindings
  • Pattern matching creates copies via pushCopy
  • Type instantiation allocates new type variables

Evidence:

alloc: 563,830 calls, 4,680 ns mean
free:  563,394 calls, 4,182 ns mean

2. Method Resolution (23.6%)

Observation: resolveMethodFunction called 102,797 times (103x per iteration).

Problem: Every call to rest.append(...) triggers method resolution:

  1. Look up the method name "append"
  2. Find the implementation for the receiver type
  3. Resolve any polymorphic constraints

Evidence:

resolveMethodFunction: 102,797 calls, 23,755 ns mean = 2.44 seconds total

Expected: Method resolution should be cached after first lookup for the same type.

3. Type Instantiation (14.0%)

Observation: instantiateType called 35,064 times with 41,311 ns mean.

Problem: Polymorphic functions like append : List a, a -> List a require type instantiation on each call. The interpreter appears to re-instantiate types even for monomorphic call sites.

Evidence:

instantiateType: 35,064 calls, 41,311 ns mean = 1.45 seconds total
typesStore.resolveVar: 3,204,487 calls (3.2 million!)

4. Dot Access Overhead (22.2%)

Observation: cont.dot_access_collect_args and cont.dot_access_resolve together take 22.2%.

Problem: The dot access path (rest.append(...)) involves:

  1. Evaluate receiver (rest)
  2. Collect arguments
  3. Resolve method
  4. Create closure call

This multi-step process is expensive for simple method calls.

Evidence:

cont.dot_access_collect_args: 12,034 calls, 116,017 ns mean
cont.dot_access_resolve: 33,927 calls, 26,505 ns mean

5. Type System Operations (~25% combined)

Observation: Unification and type operations account for significant overhead.

Function Calls Time %
unifyGuarded 103,270 3.6%
unifyVars 58,269 2.8%
unifyFlatType 57,168 2.5%
unifyStructure 58,168 2.6%
unifyNominalType 35,137 1.1%
unifyTuple 22,000 1.0%

Problem: The interpreter performs runtime type checking/unification that a compiled version would resolve at compile time.

6. The Non-Problem: List Copying

Observation: callLowLevelBuiltin (which handles list_append) is only 0.63%.

Initial hypothesis was wrong: I initially suspected O(nΒ²) list copying due to refcount > 1 preventing in-place mutation. While this may still occur, it's not the dominant cost.

Evidence:

callLowLevelBuiltin: 114,797 calls, 568 ns mean = 0.065 seconds
copyToPtr: 403,917 calls, 114 ns mean = 0.046 seconds

Optimization Recommendations

High Priority (>10% impact potential)

1. Cache Method Resolution

Current: resolveMethodFunction called 102,797 times for same method Target: Cache by (receiver_type, method_name) -> resolved_function Expected impact: -20% runtime

2. Reduce Allocations

Current: 563K allocations for 1K elements Target:

  • Pool allocators for common sizes
  • Stack allocation for small temporaries
  • Avoid copying in pattern matching when value is consumed Expected impact: -30% runtime

3. Cache Type Instantiation

Current: instantiateType called 35K times Target: Cache instantiated types for monomorphic call sites Expected impact: -10% runtime

Medium Priority (5-10% impact potential)

4. Optimize Dot Access Path

Current: Multi-continuation chain for method calls Target: Fast path for simple method calls Expected impact: -10% runtime

5. Reduce Type Variable Resolution

Current: 3.2M calls to typesStore.resolveVar Target: Cache resolved variables, avoid repeated resolution Expected impact: -5% runtime

6. Binary Operator Optimization

Current: cont.binop_apply at 16.9% Target: Inline common operations (int comparison, addition) Expected impact: -10% runtime

Lower Priority

7. Tuple Collection Optimization

  • Pre-allocate tuple memory
  • Avoid intermediate allocations

8. Pattern Matching Optimization

  • Move semantics instead of copy
  • Avoid refcount increment when source is consumed

Relevant Code Locations

Function File Line
resolveMethodFunction interpreter.zig 7822
instantiateType interpreter.zig 9131
applyContinuation interpreter.zig 12983
scheduleExprEval interpreter.zig 10292
pushCopy interpreter.zig 893
copyToPtr StackValue.zig 272
callLowLevelBuiltin interpreter.zig 1158
cont.dot_access_resolve interpreter.zig 15272
cont.dot_access_collect_args interpreter.zig 15644

Next Steps

Completed βœ…

  1. ~~Investigate resolveMethodFunction~~ - Found that method closures were being re-evaluated on every call
  2. ~~Add method resolution caching~~ - Implemented evaluated_method_cache with 88% hit rate (74% reduction)
  3. ~~Pre-allocate HashMaps~~ - Eliminated 100-200 warmup allocations
  4. ~~Implement tuple scratch buffers~~ - Eliminated 66,033 allocations (16.8% reduction)

Current Investigation πŸ”

High Priority: Type Variable Explosion

Problem: Creating 578,522 type variables for processing 1,000 elements

Breakdown:

  • typesStore.fresh: 271,813 calls
  • typesStore.freshFromContent: 306,709 calls
  • Total: 578,522 type variables created

Questions to investigate:

  1. Why do we create 578 type variables per element processed?
  2. Are these type variables truly necessary for correctness?
  3. Can we cache or reuse type variables for monomorphic cases?
  4. Is polymorphic instantiation creating excessive type variables?

Expected: For a simple fold_rev with tuples, we might expect:

  • ~22,000 type variables for tuples (one per tuple)
  • ~35,000 for function instantiation
  • Total expected: ~60,000-80,000

Actual: 578,522 (7-10x more than expected!)

Remaining Targets 🎯

Medium Priority: Remaining Memory Allocations (~327K)

With tuple allocations eliminated, remaining sources:

  • Type store growth: ~60,000-100,000 allocations (efficient exponential growth)
  • HashMap growth: ~100-200 allocations (now pre-allocated)
  • Other operations: ~200,000+ allocations (to be profiled)

Potential approach: Cache instantiated types for identical (function_type, argument_types) pairs when types are fully concrete.

Medium Priority: Binary Operator Overhead (2.8%)

Problem: cont.binop_apply has high per-call cost.

Solution: Inline common operations (int comparison, addition) without continuation overhead.

Lower Priority

  • Tuple operations: Pre-allocate tuple memory, avoid intermediate copies
  • typesStore.resolveVar: 2.8M calls - consider caching resolved results

Appendix: Raw Tracy Data

See bench_fold_append_tracy.csv for complete profiling data.

Data Collection Method

# Build with Tracy
zig build roc -Dtracy=$HOME/git/wolfpld/tracy -Dtracy-callstack=true

# Run benchmark
cd benchmarks
roc build bench_fold_append.roc
echo 1000 | ./bench_fold_append

# Export from Tracy GUI to CSV
# save bench_fold_append.tracy from the GUI, and then
tracy-csvexport bench_fold_append.tracy > bench_fold_append_tracy.csv

mpizenberg avatar Dec 15 '25 19:12 mpizenberg