expr VM performance improvements in function calls

Improve memory handling of function arguments in vm.VM by preallocating a single slice to hold all the arguments for all the function calls. This is based on an estimation made by inspecting the program's bytecode.

There are several points to argue here of course:

A program can return earlier and thus not consume all the allocated space. Answer: making a single allocation for likely very few items more than probably used is worth it since it's very little elements and a single allocation is probably more expensive in terms of GC and runtime memory management. I have also set a safety limit for preallocation just in case.
We could use program.Arguments to get the exact number of arguments being passed. Answer: While this is true, this adds a little more computation and an estimation works fairly good for most cases. We gain ~5% speed by making an estimation and it is likely to be good enough in many situations.
Programs with function calls in a predicate will probably not have enough allocated space. Answer: again, this is an optimization targeted at simple and straightforward programs, and will work in many of the most common situations. Other programs will likely see no decrease in performance, and we will still allocate as we were doing before in that case. Actually, programs with a predicate will still see a performance gain because we will allocate a bit less until the buffer is drained, then it falls back to allocate for each call.

In general, this optimization works well for many simple and common use cases and doesn't affect other cases.

Benchmark results:

goos: linux
goarch: amd64
pkg: github.com/expr-lang/expr/vm
cpu: 13th Gen Intel(R) Core(TM) i7-13700H
                          │ bench-results-old.txt │        bench-results-new.txt        │
                          │        sec/op         │   sec/op     vs base                │
VM/name=function_calls-20             1.495µ ± 0%   1.277µ ± 1%  -14.58% (p=0.000 n=20)

                          │ bench-results-old.txt │        bench-results-new.txt         │
                          │         B/op          │     B/op      vs base                │
VM/name=function_calls-20            2.297Ki ± 0%   2.625Ki ± 0%  +14.29% (p=0.000 n=20)

                          │ bench-results-old.txt │       bench-results-new.txt        │
                          │       allocs/op       │ allocs/op   vs base                │
VM/name=function_calls-20             40.000 ± 0%   1.000 ± 0%  -97.50% (p=0.000 n=20)

Sep 13 '25 07:09 diegommm

I guess OpCall1, OpCall2 OpCall3 is kind of a same way of avoiding buffer allocation. What is the speedup?

Also, probably v1.18 will gonna be refactored to a new architecture ;)

Sep 18 '25 12:09 antonmedv

I guess OpCall1, OpCall2 OpCall3 is kind of a same way of avoiding buffer allocation. What is the speedup?

At first, I also thought that OpCall1, OpCall2, and OpCall3 wouldn't allocate in the heap. But they do allocate in the heap when I run the benchmarks and they run slower.

Total speedup is 15%. And reduced to a single allocation per run in most cases.

Also, probably v1.18 will gonna be refactored to a new architecture ;)

Nice, can't wait! Ping me if you need some help :)

Sep 20 '25 23:09 diegommm

@antonmedv I answered above, let me know if you want me to try a different approach or if you think it's ok we could merge it.

Sep 29 '25 20:09 diegommm

Let me try to test it again, and run my benches as well.

Sep 30 '25 12:09 antonmedv

Hi @antonmedv! I apologize for bothering, I wanted to know if I can help providing better benchmarks. Or let me know if anything doesn't look good and I can improve it.

Thank you!

Oct 25 '25 21:10 diegommm

Hi! Sorry I was sick for lats weeks. I will come back to reviewing stuff.

Oct 26 '25 07:10 antonmedv