Looking at the performance profile of my aoc day 3 solution

Open mpizenberg opened this issue 1 month ago • 1 comments

WIP: do NOT merge

I have set up a small benchmark in order to look at how Roc performs compared to python. The details of this benchmark are provided in this message on zulip: #advent of code > 2025 Day 3 @ 💬

Out of curiosity, I’ve looked into how to use tracy to get a better analysis of the bottlenecks of this, and it’s clearly still memory, with allocation/deallocation spending roughly 40% of the budget. Tracy is pretty cool!

Below in the next comment, I’ll add the analysis stage I’m at, with the help of Claude.

Dec 15 '25 19:12 mpizenberg

Roc Interpreter Performance Analysis

Date: December 16, 2025 (Updated) Benchmark: bench_fold_append.roc with n=1000 Profiler: Tracy

Executive Summary

The Roc interpreter shows significant performance overhead compared to Python for a simple fold+append benchmark. Through systematic profiling and optimization, we've identified and addressed the major bottlenecks:

Completed Optimizations

✅ Method resolution: ~~24%~~ → 6% after caching (74% reduction)
✅ Memory management: ~~393,465 allocations~~ → 327,492 after tuple scratch buffers (16.8% reduction)
✅ HashMap initialization: Pre-allocated hot caches to avoid repeated small reallocations

Remaining Bottlenecks

Type system operations: 578,522 type variables created for 1,000 elements
Memory management: Still ~327K allocations (mostly type store growth)
Dot access overhead: 22% handling method calls on values

The actual list operations (callLowLevelBuiltin) account for only 0.76% of runtime.

Optimization Progress

✅ Optimization 1: Method Closure Caching

Problem: resolveMethodFunction was called 102,797 times, re-evaluating method closures on every call even for the same method on the same type.

Solution: Added evaluated_method_cache in interpreter.zig that caches evaluated method closures by (origin_module, nominal_ident, method_name_ident, receiver_layout_idx).

Results:

Metric	Before	After	Improvement
`resolveMethodFunction` %	23.6%	6.14%	74% reduction
Mean time per call	23,755 ns	5,152 ns	4.6x faster
Cache hit rate	N/A	88%	89,232 hits / 101,277 calls
`resolveMethod.evalWithExpectedType` calls	102,763	12,045	88% reduction

Code location: src/eval/interpreter.zig lines 206-226 (cache types), 7913-7953 (cache lookup), 8012-8046 (cache population)

✅ Optimization 2: HashMap Pre-allocation

Problem: Hot HashMaps (translate_cache, poly_cache, method caches) started with capacity 0, causing multiple small reallocations during warmup.

Solution: Pre-allocate capacity for frequently-used HashMaps in Interpreter.init():

try result.translate_cache.ensureTotalCapacity(512);
try result.poly_cache.ensureTotalCapacity(128);
try result.method_resolution_cache.ensureTotalCapacity(64);
try result.evaluated_method_cache.ensureTotalCapacity(64);
try result.flex_type_context.ensureTotalCapacity(256);
try result.rigid_subst.ensureTotalCapacity(128);
try result.instantiate_scratch.ensureTotalCapacity(64);

Results:

Eliminates ~100-200 allocations during warmup
Reduces HashMap reallocations from ~5-7 per map to 0-1
Minimal memory overhead (~20KB total)

Code location: src/eval/interpreter.zig lines 555-563

✅ Optimization 3: Tuple Scratch Buffers

Problem: Each tuple finalization allocated 3 temporary arrays (layouts, values, rt_vars), immediately freed after use. For 22,011 tuples, this caused 66,033 allocations.

Solution: Added reusable scratch buffers to the Interpreter struct:

tuple_scratch_layouts: std.array_list.Managed(Layout),
tuple_scratch_values: std.array_list.Managed(StackValue),
tuple_scratch_rt_vars: std.array_list.Managed(types.Var),

Each tuple clears and reuses these buffers instead of allocating:

self.tuple_scratch_layouts.clearRetainingCapacity();
try self.tuple_scratch_layouts.resize(total_count);
var elem_layouts = self.tuple_scratch_layouts.items;

Results:

Metric	Before	After	Improvement
Total allocations	393,465	327,492	-65,973 (-16.8%)
Total frees	380,984	315,008	-65,976 (-16.8%)
Tuple temp arrays	22,011	0	-22,011 (-100%)

Code location: src/eval/interpreter.zig lines 385-388 (struct fields), 551-553 (init), 13369-13384 (usage)

Current Profile (After All Optimizations)

Key Metrics

Metric	Original	After Optimizations	Improvement
Total allocations	393,465	327,492	-16.8% ✅
Total frees	380,984	315,008	-16.8% ✅
Method resolution time	23.6%	6.1%	-74% ✅

Remaining Bottleneck Categories

Category	% of Total	Key Functions	Status
Memory Management	~35%	`alloc` (19%), `free` (16%)	🟡 Reduced from 39%
Type System Operations	~30%	`typesStore.fresh*`, `instantiateType`	🔴 Investigate
Dot Access	~20%	`cont.dot_access_*`	🟡 Acceptable
Method Resolution	6.1%	`resolveMethodFunction`	✅ Optimized
Tuple Operations	3.5%	`cont.tuple_collect`	✅ Optimized

Why `instantiateType` Calls Remain High

The 35,064 instantiateType calls break down as:

23,041 calls from sched.call - Required for every closure invocation
~12,000 calls from cont.dot_access_collect_args - Only on cache misses

This is expected behavior: type instantiation is needed for proper polymorphism when invoking functions. The method caching successfully reduced redundant type work from method resolution.

Benchmark Description

Roc Code (`bench_fold_append.roc`)

benchmark_fold_append = |digits| {
    (_m, max_rest) = digits.fold_rev((0, []), |d, (m, rest)| {
        new_max = if d > m { d } else { m }
        (new_max, rest.append((d, m)))
    })
    max_rest
}

Python Equivalent (`bench_fold_append.py`)

def benchmark_fold_append(digits):
    max_val = 0
    max_rest = []
    for d in reversed(digits):
        new_max = d if d > max_val else max_val
        max_rest.append((d, max_val))
        max_val = new_max
    return max_rest

Observed Performance

Size	Roc (s)	Python (s)	Ratio
1000	~0.5s	~0.001s	~500x slower

Tracy Profiling Results

Top 20 Functions by Time

Rank	Function	% Time	Calls	Mean (ns)	Notes
1	`applyContinuation`	73.3%	716,580	10,559	Main dispatch loop
2	`alloc`	25.6%	563,830	4,680	Memory allocation
3	`scheduleExprEval`	25.5%	625,541	4,203	Expression scheduling
4	`evalWithExpectedType`	24.0%	114,874	21,601	Type-directed eval
5	`resolveMethodFunction`	23.6%	102,797	23,755	Method lookup
6	`free`	22.8%	563,394	4,182	Memory deallocation
7	`cont.binop_apply`	16.9%	68,870	25,270	Binary operators
8	`sched.call`	15.2%	23,041	68,154	Call scheduling
9	`instantiateType`	14.0%	35,064	41,311	Type instantiation
10	`cont.dot_access_collect_args`	13.5%	12,034	116,017	Dot access args
11	`cont.reassign_value`	13.1%	34,955	38,670	Var reassignment
12	`cont.dot_access_resolve`	8.7%	33,927	26,505	Dot access resolution
13	`cont.tuple_collect`	8.4%	66,033	13,069	Tuple construction
14	`sched.call.anno_check`	6.0%	22,034	28,236	Annotation checking
15	`typesStore.resolveVar`	4.8%	3,204,487	154	3.2M calls!
16	`evalLookupLocal`	4.5%	225,921	2,044	Local var lookup
17	`cont.call_invoke_closure`	4.9%	23,041	21,809	Closure invocation
18	`unifyGuarded`	3.6%	103,270	3,566	Type unification
19	`cont.bind_decl`	3.6%	34,015	10,829	Declaration binding
20	`translateTypeVar`	3.6%	528,012	696	Type translation

Categorized Time Breakdown

Category	% of Total	Key Functions
Memory Management	48.4%	`alloc` (25.6%), `free` (22.8%)
Method/Dot Access	45.8%	`resolveMethodFunction` (23.6%), `cont.dot_access_*` (22.2%)
Type System	~25%	`instantiateType` (14%), `unify*` (~11%), `translateTypeVar` (3.6%)
Tuple Operations	~11%	`cont.tuple_collect` (8.4%), `tuple_collect.*` (3%)
Binary Operators	16.9%	`cont.binop_apply`
Actual List Ops	0.63%	`callLowLevelBuiltin`

Detailed Analysis

1. Memory Management (48.4%)

Observation: 563,830 allocations and 563,394 frees for processing 1,000 elements.

Per-iteration allocations: ~563 allocations per element processed.

Root causes:

Every tuple (new_max, rest.append(...)) allocates
Every closure invocation may allocate for bindings
Pattern matching creates copies via pushCopy
Type instantiation allocates new type variables

Evidence:

alloc: 563,830 calls, 4,680 ns mean
free:  563,394 calls, 4,182 ns mean

2. Method Resolution (23.6%)

Observation: resolveMethodFunction called 102,797 times (103x per iteration).

Problem: Every call to rest.append(...) triggers method resolution:

Look up the method name "append"
Find the implementation for the receiver type
Resolve any polymorphic constraints

Evidence:

resolveMethodFunction: 102,797 calls, 23,755 ns mean = 2.44 seconds total

Expected: Method resolution should be cached after first lookup for the same type.

3. Type Instantiation (14.0%)

Observation: instantiateType called 35,064 times with 41,311 ns mean.

Problem: Polymorphic functions like append : List a, a -> List a require type instantiation on each call. The interpreter appears to re-instantiate types even for monomorphic call sites.

Evidence:

instantiateType: 35,064 calls, 41,311 ns mean = 1.45 seconds total
typesStore.resolveVar: 3,204,487 calls (3.2 million!)

4. Dot Access Overhead (22.2%)

Observation: cont.dot_access_collect_args and cont.dot_access_resolve together take 22.2%.

Problem: The dot access path (rest.append(...)) involves:

Evaluate receiver (rest)
Collect arguments
Resolve method
Create closure call

This multi-step process is expensive for simple method calls.

Evidence:

cont.dot_access_collect_args: 12,034 calls, 116,017 ns mean
cont.dot_access_resolve: 33,927 calls, 26,505 ns mean

5. Type System Operations (~25% combined)

Observation: Unification and type operations account for significant overhead.

Function	Calls	Time %
`unifyGuarded`	103,270	3.6%
`unifyVars`	58,269	2.8%
`unifyFlatType`	57,168	2.5%
`unifyStructure`	58,168	2.6%
`unifyNominalType`	35,137	1.1%
`unifyTuple`	22,000	1.0%

Problem: The interpreter performs runtime type checking/unification that a compiled version would resolve at compile time.

6. The Non-Problem: List Copying

Observation: callLowLevelBuiltin (which handles list_append) is only 0.63%.

Initial hypothesis was wrong: I initially suspected O(n²) list copying due to refcount > 1 preventing in-place mutation. While this may still occur, it's not the dominant cost.

Evidence:

callLowLevelBuiltin: 114,797 calls, 568 ns mean = 0.065 seconds
copyToPtr: 403,917 calls, 114 ns mean = 0.046 seconds

Optimization Recommendations

High Priority (>10% impact potential)

1. Cache Method Resolution

Current: resolveMethodFunction called 102,797 times for same method Target: Cache by (receiver_type, method_name) -> resolved_function Expected impact: -20% runtime

2. Reduce Allocations

Current: 563K allocations for 1K elements Target:

Pool allocators for common sizes
Stack allocation for small temporaries
Avoid copying in pattern matching when value is consumed Expected impact: -30% runtime

3. Cache Type Instantiation

Current: instantiateType called 35K times Target: Cache instantiated types for monomorphic call sites Expected impact: -10% runtime

Medium Priority (5-10% impact potential)

4. Optimize Dot Access Path

Current: Multi-continuation chain for method calls Target: Fast path for simple method calls Expected impact: -10% runtime

5. Reduce Type Variable Resolution

Current: 3.2M calls to typesStore.resolveVar Target: Cache resolved variables, avoid repeated resolution Expected impact: -5% runtime

6. Binary Operator Optimization

Current: cont.binop_apply at 16.9% Target: Inline common operations (int comparison, addition) Expected impact: -10% runtime

Lower Priority

7. Tuple Collection Optimization

Pre-allocate tuple memory
Avoid intermediate allocations

8. Pattern Matching Optimization

Move semantics instead of copy
Avoid refcount increment when source is consumed

Relevant Code Locations

Function	File	Line
`resolveMethodFunction`	interpreter.zig	7822
`instantiateType`	interpreter.zig	9131
`applyContinuation`	interpreter.zig	12983
`scheduleExprEval`	interpreter.zig	10292
`pushCopy`	interpreter.zig	893
`copyToPtr`	StackValue.zig	272
`callLowLevelBuiltin`	interpreter.zig	1158
`cont.dot_access_resolve`	interpreter.zig	15272
`cont.dot_access_collect_args`	interpreter.zig	15644

Next Steps

Completed ✅

~~Investigate resolveMethodFunction~~ - Found that method closures were being re-evaluated on every call
~~Add method resolution caching~~ - Implemented evaluated_method_cache with 88% hit rate (74% reduction)
~~Pre-allocate HashMaps~~ - Eliminated 100-200 warmup allocations
~~Implement tuple scratch buffers~~ - Eliminated 66,033 allocations (16.8% reduction)

Current Investigation 🔍

High Priority: Type Variable Explosion

Problem: Creating 578,522 type variables for processing 1,000 elements

Breakdown:

typesStore.fresh: 271,813 calls
typesStore.freshFromContent: 306,709 calls
Total: 578,522 type variables created

Questions to investigate:

Why do we create 578 type variables per element processed?
Are these type variables truly necessary for correctness?
Can we cache or reuse type variables for monomorphic cases?
Is polymorphic instantiation creating excessive type variables?

Expected: For a simple fold_rev with tuples, we might expect:

~22,000 type variables for tuples (one per tuple)
~35,000 for function instantiation
Total expected: ~60,000-80,000

Actual: 578,522 (7-10x more than expected!)

Remaining Targets 🎯

Medium Priority: Remaining Memory Allocations (~327K)

With tuple allocations eliminated, remaining sources:

Type store growth: ~60,000-100,000 allocations (efficient exponential growth)
HashMap growth: ~100-200 allocations (now pre-allocated)
Other operations: ~200,000+ allocations (to be profiled)

Potential approach: Cache instantiated types for identical (function_type, argument_types) pairs when types are fully concrete.

Medium Priority: Binary Operator Overhead (2.8%)

Problem: cont.binop_apply has high per-call cost.

Solution: Inline common operations (int comparison, addition) without continuation overhead.

Lower Priority

Tuple operations: Pre-allocate tuple memory, avoid intermediate copies
typesStore.resolveVar: 2.8M calls - consider caching resolved results

Appendix: Raw Tracy Data

See bench_fold_append_tracy.csv for complete profiling data.

Data Collection Method

# Build with Tracy
zig build roc -Dtracy=$HOME/git/wolfpld/tracy -Dtracy-callstack=true

# Run benchmark
cd benchmarks
roc build bench_fold_append.roc
echo 1000 | ./bench_fold_append

# Export from Tracy GUI to CSV
# save bench_fold_append.tracy from the GUI, and then
tracy-csvexport bench_fold_append.tracy > bench_fold_append_tracy.csv

Dec 15 '25 19:12 mpizenberg

Looking at the performance profile of my aoc day 3 solution

Roc Interpreter Performance Analysis

Executive Summary

Completed Optimizations

Remaining Bottlenecks

Optimization Progress

✅ Optimization 1: Method Closure Caching

✅ Optimization 2: HashMap Pre-allocation

✅ Optimization 3: Tuple Scratch Buffers

Current Profile (After All Optimizations)

Key Metrics

Remaining Bottleneck Categories

Why instantiateType Calls Remain High

Benchmark Description

Roc Code (bench_fold_append.roc)

Python Equivalent (bench_fold_append.py)

Observed Performance

Tracy Profiling Results

Top 20 Functions by Time

Categorized Time Breakdown

Detailed Analysis

1. Memory Management (48.4%)

2. Method Resolution (23.6%)

3. Type Instantiation (14.0%)

4. Dot Access Overhead (22.2%)

5. Type System Operations (~25% combined)

6. The Non-Problem: List Copying

Optimization Recommendations

High Priority (>10% impact potential)

1. Cache Method Resolution

2. Reduce Allocations

3. Cache Type Instantiation

Medium Priority (5-10% impact potential)

4. Optimize Dot Access Path

5. Reduce Type Variable Resolution

6. Binary Operator Optimization

Lower Priority

7. Tuple Collection Optimization

8. Pattern Matching Optimization

Relevant Code Locations

Next Steps

Completed ✅

Current Investigation 🔍

High Priority: Type Variable Explosion

Remaining Targets 🎯

Medium Priority: Remaining Memory Allocations (~327K)

Medium Priority: Binary Operator Overhead (2.8%)

Lower Priority

Appendix: Raw Tracy Data

Data Collection Method

Why `instantiateType` Calls Remain High

Roc Code (`bench_fold_append.roc`)

Python Equivalent (`bench_fold_append.py`)