gluon Performance of function calls and future outlook

I tested out Gluon for the purpose of embedding game logic in serialized files that then get compiled at Rust runtime and executed repeatedly. This requires a certain speed.

I have to confess here that I don't actually understand by reading the documentation how Gluon works here exactly. For example what is the exact output of gluon::ThreadExt::compile_script and whether my perception of "compile once, execute many times" of Gluon is actually correct.

An addition function in Gluon called from Rust was 250-290 times slower than a Rust function. This makes Gluon too slow for this kind of purpose. At 250 ns this means the function can execute only 4000 times per ms and with only 16.6 ms available per frame, this can get problematic with certain kinds of games.

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use gluon::{new_vm, vm::api::{Hole, FunctionRef, OpaqueValue}, ThreadExt, Thread};

fn criterion_benchmark(c: &mut Criterion) {
    let closure = |x: i32, y: i32| x + y;

    c.bench_function("rust_add", |b| b.iter(|| closure(black_box(2), black_box(3))));

    let vm = new_vm();
    vm.run_expr::<OpaqueValue<&Thread, Hole>>("example", r#" import! std.int "#).unwrap();
    let mut add: FunctionRef<fn(i32, i32) -> i32> = vm.get_global("std.int.num.(+)").unwrap();

    c.bench_function("gluon_add", |b| b.iter(|| add.call(2, 3)));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

rust_add                time:   [927.76 ps 934.23 ps 944.54 ps]
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe

gluon_add               time:   [258.57 ns 258.80 ns 259.04 ns]
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

(I might be misunderstanding something about how this kind of benchmark should be done.)

How does Gluon's future look in terms of improving this kind of performance?

I recall an issue here which benchmarked Factorial for Gluon and Lua and Gluon was ~4 times slower. It's completely understandable to focus on language features before performance optimization, but it would be good to know:

Can embedded function call performance get much better? (e.g. at least bridge the gap to Lua) or are there architectural constraints?
If yes, how much work would improving performance here entail?
Any idea where the bottleneck(s) is(are)? For example it could be the case that most of the slowness is due to the embedding API, but the functional call itself within Gluon is very fast. I'm not sure what gluon-rust communication costs are implied by the embedding API.
What chance is there to see a significant performance improvement in a year from now?

Jul 16 '20 16:07 Anvoker

The benchmark would be better if it checked the difference between lua and gluon, but to make it compare to rust it ought to ensure that the closure isn't inlined as well

c.bench_function("rust_add", |b| b.iter(|| black_box(closure)(black_box(2), black_box(3))));

Even then, black_box is no guarantee against inlining so I wouldn't be surprised if it is still inlined and constant folded away.

Can embedded function call performance get much better? (e.g. at least bridge the gap to Lua) or are there architectural constraints?

The only fundamental problem I can think of is the need for Mutex locking in the vm, but even this might be solvable. It really just needs work to reduce the difference.

If yes, how much work would improving performance here entail?

Probably not that much for rust -> gluon function calls really though there is probably a need for more general improvements as well or else

Any idea where the bottleneck(s) is(are)?

The API ought to be very close to zero overhead. There may be some improvements done to make stack handling a bit faster I guess (less pointer chasing). Most of it would be from the atomic operations in the mutex locking the gluon vm and from setting up the stackframe.

The virtual machine is also stack based instead of register based which likely penalizes this case quite a bit. In lua I think this would be encoded as a single instruction adding register x and y whereas gluon as three instructions (Push(x), Push(y), Add).

What chance is there to see a significant performance improvement in a year from now?

I'd say very likely, I have been a bit slow with updates lately as I tend to move back and forth between hobbies. Currently picking gluon back up again though but have been busy updating and fixing the language server. Next thing would be to rewrite the inliner/constant folder (throwing out https://github.com/gluon-lang/gluon/pull/819 since that ended up to big of a mess) which may take some time but is the only major thing left I feel gluon absolutely needs. After that improving the vm itself would be next on the list.

Jul 20 '20 19:07 Marwes

Thank you for the closure black_box tip and caveat. The reason for comparing to to Rust in terms of function overhead was meant to draw attention to Gluon's relevance to games. And often the easiest way to reason about that is "how many rust function calls am I missing out on being able to do if I call x scripting language's function instead". And you had already offered a decent comparison to Lua. Mun was also on my mind a lot when thinking about this. It is much less feature complete than Gluon but its function calls are a few times faster than LuaJIT and very comparable to Rust. I will look into the source code of their benchmark to get a better idea of how to do this, it's quite possible I'm comparing apples to oranges.

All of the rest sounds very encouraging, thank you for getting into it. Without some information / context it ends up being really getting hard to judge the usability of a WIP language.

The virtual machine is also stack based instead of register based which likely penalizes this case quite a bit.

This in particular is pretty interesting to know. But yeah, overall it sounds like nothing in particular other than development time is blocking Gluon from being much faster here. At least, eyeballing it, it seems like it would get much faster before it being a stack-based VM started mattering.

(Also feel free to close this, since I got the information I wanted.)

Jul 20 '20 20:07 Anvoker

I will look into the source code of their benchmark to get a better idea of how to do this, it's quite possible I'm comparing apples to oranges.

Mun uses LLVM to compile so it is naturally going to be closer to rust with this kind of performance. I don't have any direct plans for doing native compilation with gluon, however I did do an experiment to JIT with cranelift which may be interesting to pursue at some point but I only took it far enough to compile arithmetic and branching on booleans.

Jul 20 '20 21:07 Marwes

gluon gluon copied to clipboard

Performance of function calls and future outlook

gluon
gluon copied to clipboard