are-we-fast-yet Rust

Rust

Open OvermindDL1 opened this issue 2 years ago • 27 comments

Mostly just adding some rust code as a time sink so far so unsure how quickly it will be completed (feel free to PR updates to my fork if anyone wants?).

Completed so far is just Towers.

Primarily I'm just sending this draft PR for now to make sure that I'm following the same pattern and for any other obvious issues, though it is not idiomatic Rust whatsoever, especially with the huge amount of needless memory allocation I've done so far in an attempt to precisely duplicate the Java code as closely as possible (all this memory allocation is so painful to me, ^.^).

I'll add more benchmarks as I get occasional free time (or others PR things to my fork).

A couple things of note, the main file is named bin/harness.rs instead of the usual main, to follow the pattern of the Java code and contribution recommendations, not standard rust though and there is a .gitignore to only include that which is actually wanted (no build artifacts) in the rust section itself. It does of course need the recent stable rust compile version to compile (which you can get via rustup), and I'm using no dependencies thus far (much as the usual few are very tempting...).

Clippy, the official rust linter, is set to pedantic mode and the code is clean of all warnings, as per the recommended contribution suggestions. The code is also formatted with the official rust formatter as well. Both are part of cargo (the official rust build system).

To compile it just go to benchmarks/Rust and (with latest stable rust installed via rustup or so) run cargo build --release to build a release version and then run it with target/release/harness, or do it all in one step with cargo run --release -- <argshere>.

Mar 28 '22 22:03 OvermindDL1

Since the Towers is all that's done currently with the little work I've put in to it so far (as it was a simple and short one to do), here's how it compares with a few comparison runs with javascript and java and python as well:

❯ Rust/target/release/harness Towers 5 50
Starting Towers benchmark ...
Towers: iterations=1 runtime: 2634us
Towers: iterations=1 runtime: 2596us
Towers: iterations=1 runtime: 2608us
Towers: iterations=1 runtime: 2673us
Towers: iterations=1 runtime: 2647us
Towers: iterations=5 average:2632us total:13161us


Total Runtime: 13161us

❯ java -classpath Java/benchmarks.jar Harness Towers 5 50
Starting Towers benchmark ...
Towers: iterations=1 runtime: 11004us
Towers: iterations=1 runtime: 9299us
Towers: iterations=1 runtime: 3781us
Towers: iterations=1 runtime: 3015us
Towers: iterations=1 runtime: 3159us
Towers: iterations=5 average: 6051us total: 30258us


Total Runtime: 30258us

❯ node JavaScript/harness.js Towers 5 50
Starting Towers benchmark ...
Towers: iterations=1 runtime: 10869us
Towers: iterations=1 runtime: 5546us
Towers: iterations=1 runtime: 5441us
Towers: iterations=1 runtime: 5470us
Towers: iterations=1 runtime: 5474us
Towers: iterations=5 average: 6560us total: 32800us


Total Runtime: 32800us

❯ python Python/harness.py Towers 5 50
Starting Towers benchmark ...
Towers: iterations=1 runtime: 539472us
Towers: iterations=1 runtime: 530249us
Towers: iterations=1 runtime: 525513us
Towers: iterations=1 runtime: 524154us
Towers: iterations=1 runtime: 515153us
Towers: iterations=5 average: 526908us total: 2634541us


Total Runtime: 2634541us

Mar 28 '22 22:03 OvermindDL1

Very interesting. Thanks!

I think, Rust is going to be a challenge. So far, we have only have support for languages with garbage collection, i.e., automated memory management.

While I think that many of the benchmarks will have a stack discipline and could map well to Rust, I am not sure all benchmarks will.

I suspect, there will be need for some careful though of how to do things, and weighing different alternatives/options.

Mar 28 '22 23:03 smarr

Idiomatic rust projects typically one of those things as a substitute for garbage collection:

Reference counting: Rc (or Arc)
arena allocation (primarily bumpalo and typed-arena
Typed indices into a pre-allocated vector Vec
- See generational-arena for a refinement of this.

Reference counting is going to be slow but one could view arena allocation or a vector as "cheating".

There are a couple attempts to get tracing garbage collection in rust. Aside from rust-gc they are almost all experimental. Rread here for a quick overview.

(NOTE: Despite being reliable and mature, the rust-gc project has similar overhead as reference counting, so there's no performance advantage there. I can give advice on the other projects if you're interested in trying them.)

For now I would recommend using Rc or avoiding benchmarks involving heavy GC.

Using Rc would also have a nice symmetry with C++ std::shared_ptr (pretty much the same implementation).

Unfortunately conventional wisdom puts (pervasive) use of reference counting as far slower than generational GC.

Idiomatic rust code tends to pervasive use of reference counts, so they don't run into this performance problem very much. If they do, they just switch to arena allocation or indices in a vector (as mentioned above) .

Unfortunately all of this combines to make a direct comparison between reference counting and traditional GC very difficult to do. You're either going to end up with code that is not "idiomatic Rust" or you're going to end up with code that optimizes away all the work 😦

Mar 29 '22 05:03 Techcable

I think, we will need to work through each benchmark and see what makes sense.

Something like NBody, with a fixed set of bodies seems different than for instance the Json benchmark, where an "unknown" JSON file gets parsed.

Mar 29 '22 06:03 smarr

While I think that many of the benchmarks will have a stack discipline and could map well to Rust, I am not sure all benchmarks will.

That's why I kept it to a java-way in the Rust code, it's not idiomatic rust code at all, there's a ton of needless memory allocations and all (like for the towers you'd generally either pass the count of disks as a constant argument to use the stack or you'd allocate all of them and index them instead for pure dynamism, which would be a single allocation), though that's really only mostly a thing on the tower's one at those sizes because I was only testing it with size 13 (as the other code), which is tiny enough that memory allocation time dominates (GC's could theoretically do very well on these though as they are capable of hoisting such things to the stack internally where Rust couldn't do that, I'm actually rather surprised at how poorly java performed, I'm tempted to look in to that...). Another rust way that is closer to the java code but still not entirely idiomatic is just to make each pile a Vec of disks instead, which is also quite doable considering how tiny the disks are. But right now it's basically duplicating what the java code does, lot of little allocations and all (only like 13 of them but still) and it's just using the system allocator straight (no caching layer), it's as basic and as like the java code as it can get.

But worst case scenario the rust code can just fall back to using Rc's/Weak's (smart pointers) to be like any corresponding code of the other styles, so not an issue to map things regardless (as long as you don't want to be rust idiomatic, which would keep the algorithm the same but could massively invert how data is stored as rust is more soa style of programming).

I suspect, there will be need for some careful though of how to do things, and weighing different alternatives/options.

Yeah, that's why I'm just copying the java for now, if optimizations past that or if you want to make things Rust idiomatic layer then that can always be done. But keeping it java-like at least gives a more direct apples-to-apples performance comparison even if it is not how one would actually write rust code.

Reference counting: Rc (or Arc)

The smart pointers, with associated Weak versions too.

arena allocation (primarily bumpalo and typed-arena

You don't even need to go that fancy, the "usual" rust way is to just allocate a big vector of things and index in to it because Rust really, like really wants a single ownership point and then index in to that, consequently it works very well with that pattern.

Typed indices into a pre-allocated vector Vec

Ah yep, basically that, it's trivial to do, no libraries needed, and yes a newtype for the index makes it all easy to not ever mess up.

Reference counting is going to be slow but one could view arena allocation or a vector as "cheating".

Eh, not really as slow as you'd think, remember single-threaded shared pointers in rust are a single integer comparison on cloning and destruction, no other time, and multi-threaded shared pointers (are there any multi-threaded benchmarks in this? I haven't noticed any as of yet) are the same but just using an atomic instruction (or mutex on arch's that don't have atomics).

There are a couple attempts to get tracing garbage collection in rust. Aside from rust-gc they are almost all experimental. Rread here for a quick overview.

No, don't touch those, they break some of the rust invariants around safety, they are not safe to use (safe in a Rust way, it's still far safer than, say, anything in java).

For now I would recommend using Rc or avoiding benchmarks involving heavy GC.

Actually those are the ones I'm most curious in! If you have a specific recommendation then I'll focus on porting that next, I want to see how an Rc based allocation-heavy non-rust style pattern works in Rust, just for pure curiosity sake. :-)

Using Rc would also have a nice symmetry with C++ std::shared_ptr (pretty much the same implementation).

Yes, std::shared_ptr is almost identical to how Rc/Arc works (they aren't exactly identical because Rust can make assumptions due to lack of aliasing issues in the language).

Unfortunately conventional wisdom puts (pervasive) use of reference counting as far slower than generational GC.

If you do a lot of cloning of the root pointers (the shared pointers) then yeah, it can have a touch of overhead, but it's only in an extra, easily branch predicted comparison and jump, which is basically no cost on x86. And even then most of the time you don't do that as you don't really change ownership of the root pointers often anyway so even that's not an issue in the vast amount of cases.

Unfortunately all of this combines to make a direct comparison between reference counting and traditional GC very difficult to do. You're either going to end up with code that is not "idiomatic Rust" or you're going to end up with code that optimizes away all the work frowning

Hence why I'm going with the non-idiomatic rust, I'm more curious in an apples-to-apples comparison for now. Can always make a Rust-Idiomatic benchmark set later if someone really wants to.

Something like NBody, with a fixed set of bodies seems different than for instance the Json benchmark, where an "unknown" JSON file gets parsed.

Hmm, the json might be a good one for me to port next, it's going to have a ton of tiny allocations regardless, not idiomatic but it'll work fine. I might still use an enum instead of traits because rust isn't built to do down/upcasting like what that code does (it's not an OOP language) and to emulate that would need a LOT of functions to do specific returns for each known variant, which ends up just being an enum anyway but significantly more verbose. This is a benchmark that I could see java doing decently on, though not a benchmark that would really take advantage of its tracing JIT so unsure if it could be 'faster' though, but tons of tiny allocations java could definitely do better on than rust calling the comparatively slow system allocator directly for each one.

Mar 29 '22 15:03 OvermindDL1

But yeah, if you think a certain benchmark should be remade to be idiomatic instead of just copying what java does, please state, otherwise I'll just try to copy java as close as possible as feasible.

Mar 29 '22 15:03 OvermindDL1

Added the json benchmark, the timings:

❯ Rust/target/release/harness Json 5 50             
Starting Json benchmark ...
Json: iterations=1 runtime: 36172us
Json: iterations=1 runtime: 35525us
Json: iterations=1 runtime: 35747us
Json: iterations=1 runtime: 35324us
Json: iterations=1 runtime: 35346us
Json: iterations=5 average:35623us total:178116us


Total Runtime: 178116us

❯ java -classpath Java/benchmarks.jar Harness Json 5 50
Starting Json benchmark ...
Json: iterations=1 runtime: 118819us
Json: iterations=1 runtime: 101561us
Json: iterations=1 runtime: 59921us
Json: iterations=1 runtime: 60682us
Json: iterations=1 runtime: 92477us
Json: iterations=5 average: 86692us total: 433460us


Total Runtime: 433460us

❯ node JavaScript/harness.js Json 5 50                 
Starting Json benchmark ...
Json: iterations=1 runtime: 115848us
Json: iterations=1 runtime: 95531us
Json: iterations=1 runtime: 90106us
Json: iterations=1 runtime: 93786us
Json: iterations=1 runtime: 92901us
Json: iterations=5 average: 97634us total: 488172us


Total Runtime: 488172us

I'm consistently impressed by the good showing by node. I wonder how deno compares to node on this.

Of course the Rust code could could be vastly optimized (at the very least for readability, a little in performance though), I'm doing a whole lot of things very inefficiently and in a very verbose way to more copy the java way of things, but still.

EDIT: Performed a quick perf on the json parsing, the vast majority of the time is indeed spent in the system memory allocator, so could probably shave upwards of half the time off if that were optimized but that would require a significant code change and API change so it's definitely not worth it right now.

Any suggestions on the next benchmark to add when I get time?

Mar 29 '22 17:03 OvermindDL1

Just for out of pure curiosity I ran all the tests again with tcmalloc preloaded into the processes (though I doubt it will do anything for the GC languages, but for completion anyway):

❯ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 Rust/target/release/harness Json 5 50
Starting Json benchmark ...
Json: iterations=1 runtime: 25179us
Json: iterations=1 runtime: 24678us
Json: iterations=1 runtime: 24184us
Json: iterations=1 runtime: 24199us
Json: iterations=1 runtime: 24698us
Json: iterations=5 average:24588us total:122940us


Total Runtime: 122940us

Prior rust time was 178116us, a surprisingly substantial change, though considering how badly I'm allocating memory in the rust code it makes sense, better memory handling could get better than what tcmalloc could do anyway.

❯ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 java -classpath Java/benchmarks.jar Harness Json 5 50
Starting Json benchmark ...
Json: iterations=1 runtime: 117712us
Json: iterations=1 runtime: 101946us
Json: iterations=1 runtime: 60810us
Json: iterations=1 runtime: 61014us
Json: iterations=1 runtime: 90687us
Json: iterations=5 average: 86433us total: 432169us


Total Runtime: 432169us

Prior java time was 433460us, so no real change.

❯ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 node JavaScript/harness.js Json 5 50
Starting Json benchmark ...
Json: iterations=1 runtime: 116426us
Json: iterations=1 runtime: 94598us
Json: iterations=1 runtime: 92799us
Json: iterations=1 runtime: 96691us
Json: iterations=1 runtime: 93219us
Json: iterations=5 average: 98747us total: 493733us


Total Runtime: 493733us

Prior node time was 488172us, so again no real change.

Mar 29 '22 17:03 OvermindDL1

And made clippy happy again, which involved one performance change (just allocating an object within an allocated value instead of just allocating the value with the inline object, which shrunk the value size even if caused more allocations, which reduced the size taken of every allocated value, and funny enough it makes it even closer to the java style as it is doing an allocation that it wasn't doing before but that java is doing as well), which cause timing changes, here it is with and without tcmalloc:

❯ Rust/target/release/harness Json 5 50                                                      
Starting Json benchmark ...
Json: iterations=1 runtime: 21499us
Json: iterations=1 runtime: 21008us
Json: iterations=1 runtime: 21326us
Json: iterations=1 runtime: 21070us
Json: iterations=1 runtime: 21160us
Json: iterations=5 average:21212us total:106064us


Total Runtime: 106064us

And with tcmalloc:

❯ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 Rust/target/release/harness Json 5 50 
Starting Json benchmark ...
Json: iterations=1 runtime: 14263us
Json: iterations=1 runtime: 13936us
Json: iterations=1 runtime: 14813us
Json: iterations=1 runtime: 15601us
Json: iterations=1 runtime: 16927us
Json: iterations=5 average:15108us total:75541us


Total Runtime: 75541us

Quite the sizeable change for them both by shrinking the JsonValue size.

EDIT: And just to see, I tested compiling with the latest nightly version of rust instead of the latest stable and it is consistently about 500us faster even after a lot of runs but it's technically still within the error bar so could just be noise regardless of how consistent it is. Will keep using the latest stable for all reported values though, no beta or nightly.

Mar 29 '22 18:03 OvermindDL1

Went ahead and quickly added the NBody benchmark this morning, clippy is happy, results:

❯ Rust/target/release/harness NBody 5 250000
Starting NBody benchmark ...
NBody: iterations=1 runtime: 25846us
NBody: iterations=1 runtime: 26132us
NBody: iterations=1 runtime: 26073us
NBody: iterations=1 runtime: 26288us
NBody: iterations=1 runtime: 26278us
NBody: iterations=5 average:26124us total:130620us


Total Runtime: 130620us

❯ java -classpath Java/benchmarks.jar Harness NBody 5 250000
Starting NBody benchmark ...
NBody: iterations=1 runtime: 37444us
NBody: iterations=1 runtime: 28579us
NBody: iterations=1 runtime: 28294us
NBody: iterations=1 runtime: 28348us
NBody: iterations=1 runtime: 28375us
NBody: iterations=5 average: 30208us total: 151040us


Total Runtime: 151040us

❯ node JavaScript/harness.js NBody 5 250000
Starting NBody benchmark ...
NBody: iterations=1 runtime: 43579us
NBody: iterations=1 runtime: 36293us
NBody: iterations=1 runtime: 37564us
NBody: iterations=1 runtime: 31052us
NBody: iterations=1 runtime: 31581us
NBody: iterations=5 average: 36014us total: 180069us


Total Runtime: 180069us

Mar 30 '22 15:03 OvermindDL1

Just as an aside, I did acquire Deno but it looks like the existing javascript benchmark is calling node'isms instead of the standard calls of things, so it would need a little massaging (doesn't seem like it would take much though). If a Deno one is added then the javascript one should probably be rewritten to fully typed typescript for a new benchmark case as Deno is a typescript runner (made by the same people who made node, it's eventual goal is to replace node) and it optimizes that case.

Mar 30 '22 16:03 OvermindDL1

So... Storage looks like a memory allocation horror, and it's short as well, let's do it next!

And done and clippy is happy, results:

❯ Rust/target/release/harness Storage 5 150 
Starting Storage benchmark ...
Storage: iterations=1 runtime: 21716us
Storage: iterations=1 runtime: 21039us
Storage: iterations=1 runtime: 21181us
Storage: iterations=1 runtime: 21229us
Storage: iterations=1 runtime: 21347us
Storage: iterations=5 average:21302us total:106514us


Total Runtime: 106514us

❯ java -classpath Java/benchmarks.jar Harness Storage 5 150                                                
Starting Storage benchmark ...
Storage: iterations=1 runtime: 35176us
Storage: iterations=1 runtime: 17203us
Storage: iterations=1 runtime: 24834us
Storage: iterations=1 runtime: 24967us
Storage: iterations=1 runtime: 24664us
Storage: iterations=5 average: 25368us total: 126844us


Total Runtime: 126844us

❯ node JavaScript/harness.js Storage 5 150 
Starting Storage benchmark ...
Storage: iterations=1 runtime: 29177us
Storage: iterations=1 runtime: 22583us
Storage: iterations=1 runtime: 24492us
Storage: iterations=1 runtime: 22750us
Storage: iterations=1 runtime: 25908us
Storage: iterations=5 average: 24982us total: 124910us


Total Runtime: 124910us

And Rust with tcmalloc just for good measure:

❯ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 Rust/target/release/harness Storage 5 150
Starting Storage benchmark ...
Storage: iterations=1 runtime: 14289us
Storage: iterations=1 runtime: 14073us
Storage: iterations=1 runtime: 14734us
Storage: iterations=1 runtime: 14186us
Storage: iterations=1 runtime: 14074us
Storage: iterations=5 average:14271us total:71358us


Total Runtime: 71358us

This is utterly absolutely not idiomatic Rust, it's very much Java-In-Rust, lol, but should be apples to apples. I'm impressed by node here, it did better than Java!

Mar 30 '22 16:03 OvermindDL1

And added Sieve since it was also very tiny to do, no real memory allocations in it, it's all just 'work', clippy is happy, results:

❯ Rust/target/release/harness Sieve 5 500                
Starting Sieve benchmark ...
Sieve: iterations=1 runtime: 8839us
Sieve: iterations=1 runtime: 8738us
Sieve: iterations=1 runtime: 8890us
Sieve: iterations=1 runtime: 8868us
Sieve: iterations=1 runtime: 8838us
Sieve: iterations=5 average:8835us total:44176us


Total Runtime: 44176us

❯ java -classpath Java/benchmarks.jar Harness Sieve 5 500      
Starting Sieve benchmark ...
Sieve: iterations=1 runtime: 26986us
Sieve: iterations=1 runtime: 8576us
Sieve: iterations=1 runtime: 8414us
Sieve: iterations=1 runtime: 8138us
Sieve: iterations=1 runtime: 8657us
Sieve: iterations=5 average: 12154us total: 60771us


Total Runtime: 60771us

❯ node JavaScript/harness.js Sieve 5 500                
Starting Sieve benchmark ...
Sieve: iterations=1 runtime: 38318us
Sieve: iterations=1 runtime: 33700us
Sieve: iterations=1 runtime: 33777us
Sieve: iterations=1 runtime: 33516us
Sieve: iterations=1 runtime: 33839us
Sieve: iterations=5 average: 34630us total: 173150us


Total Runtime: 173150us

I'm honestly impressed at how poorly node did here... >.>

Mar 30 '22 16:03 OvermindDL1

Decided to tackle a huge one for this next one, so chose Richards.

Oh... oh wow... this... is probably the non-idiomatic Rust so far, soooooo many shared pointers and lack of proper ownership and so many reference counts and oh the cloning, I see the horror of cloning everywhere, lol.

But to sum it up, again programmed to copy the Java one, it is exceptionally not idiomatic and I really really resisted fixing a lot of the code to actually be ownership aware and fast, and on the plus side I found a bug in the java version of the code (that doesn't change the output, it's purely just a double check if something is null for some weird reason...). So, the results of this reference counted shared pointer horror:

❯ Rust/target/release/harness Richards 5 50
Starting Richards benchmark ...
Richards: iterations=1 runtime: 56736us
Richards: iterations=1 runtime: 56431us
Richards: iterations=1 runtime: 56884us
Richards: iterations=1 runtime: 55852us
Richards: iterations=1 runtime: 55851us
Richards: iterations=5 average: 56351us total:281756us


Total Runtime: 281756us

❯ java -classpath Java/benchmarks.jar Harness Richards 5 50         
Starting Richards benchmark ...
Richards: iterations=1 runtime: 103699us
Richards: iterations=1 runtime: 78204us
Richards: iterations=1 runtime: 79203us
Richards: iterations=1 runtime: 79454us
Richards: iterations=1 runtime: 80801us
Richards: iterations=5 average: 84272us total: 421361us


Total Runtime: 421361us

❯ node JavaScript/harness.js Richards 5 50                 
Starting Richards benchmark ...
Richards: iterations=1 runtime: 155446us
Richards: iterations=1 runtime: 123969us
Richards: iterations=1 runtime: 124509us
Richards: iterations=1 runtime: 123459us
Richards: iterations=1 runtime: 123907us
Richards: iterations=5 average: 130258us total: 651290us


Total Runtime: 651290us

Mar 31 '22 21:03 OvermindDL1

And just for completion, tcmalloc didn't make any appreciable change (not really much allocating going on, mostly just shared pointer management):

❯ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 Rust/target/release/harness Richards 5 50
Starting Richards benchmark ...
Richards: iterations=1 runtime: 56808us
Richards: iterations=1 runtime: 56458us
Richards: iterations=1 runtime: 56142us
Richards: iterations=1 runtime: 56542us
Richards: iterations=1 runtime: 55691us
Richards: iterations=5 average: 56328us total:281643us


Total Runtime: 281643us

A perf run does indeed show that a lot of the time is spent in bookkeeping of pointers and memory, could probably shave off a good 20% if it were done differently just to start with.

Mar 31 '22 21:03 OvermindDL1

@OvermindDL1 thanks for putting all the effort into this. Please don't consider my silence and absence of interest. It's just end of term time here. So, pretty busy time of the year...

Mar 31 '22 21:03 smarr

@smarr Oh no worry, I'm doing this just for fun, it's a nice break from my usual work as I get some free time. :-)

Mar 31 '22 22:03 OvermindDL1

Went ahead and added the Queens benchmark as well as it was extremely simple, results:

❯ Rust/target/release/harness Queens 5 500
Starting Queens benchmark ...
Queens: iterations=1 runtime: 5644us
Queens: iterations=1 runtime: 5452us
Queens: iterations=1 runtime: 5574us
Queens: iterations=1 runtime: 5520us
Queens: iterations=1 runtime: 5431us
Queens: iterations=5 average: 5524us total:27623us


Total Runtime: 27623us

❯ java -classpath Java/benchmarks.jar Harness Queens 5 500
Starting Queens benchmark ...
Queens: iterations=1 runtime: 24963us
Queens: iterations=1 runtime: 14836us
Queens: iterations=1 runtime: 16239us
Queens: iterations=1 runtime: 15231us
Queens: iterations=1 runtime: 15090us
Queens: iterations=5 average: 17271us total: 86359us


Total Runtime: 86359us

❯ node JavaScript/harness.js Queens 5 500
Starting Queens benchmark ...
Queens: iterations=1 runtime: 43868us
Queens: iterations=1 runtime: 36680us
Queens: iterations=1 runtime: 36755us
Queens: iterations=1 runtime: 36444us
Queens: iterations=1 runtime: 35992us
Queens: iterations=5 average: 37948us total: 189739us


Total Runtime: 189739us

Again, I'd figure V8/node's JIT would optimize simple math better for things like this...

Mar 31 '22 22:03 OvermindDL1

And added Permute, another super simple one, the results:

❯ Rust/target/release/harness Permute 5 500
Starting Permute benchmark ...
Permute: iterations=1 runtime: 15935us
Permute: iterations=1 runtime: 15860us
Permute: iterations=1 runtime: 15960us
Permute: iterations=1 runtime: 16043us
Permute: iterations=1 runtime: 15945us
Permute: iterations=5 average: 15949us total:79745us


Total Runtime: 79745us

❯ java -classpath Java/benchmarks.jar Harness Permute 5 500
Starting Permute benchmark ...
Permute: iterations=1 runtime: 25438us
Permute: iterations=1 runtime: 18439us
Permute: iterations=1 runtime: 18734us
Permute: iterations=1 runtime: 18085us
Permute: iterations=1 runtime: 18539us
Permute: iterations=5 average: 19847us total: 99235us


Total Runtime: 99235us

❯ node JavaScript/harness.js Permute 5 500
Starting Permute benchmark ...
Permute: iterations=1 runtime: 34046us
Permute: iterations=1 runtime: 31025us
Permute: iterations=1 runtime: 31057us
Permute: iterations=1 runtime: 31045us
Permute: iterations=1 runtime: 30197us
Permute: iterations=5 average: 31474us total: 157370us


Total Runtime: 157370us

Mar 31 '22 22:03 OvermindDL1

And adding Mandelbrot, I'd expect all to perform pretty equally here, it's just all easily optimized math:

❯ Rust/target/release/harness Mandelbrot 5 750
Starting Mandelbrot benchmark ...
Mandelbrot: iterations=1 runtime: 114109us
Mandelbrot: iterations=1 runtime: 113950us
Mandelbrot: iterations=1 runtime: 114004us
Mandelbrot: iterations=1 runtime: 114195us
Mandelbrot: iterations=1 runtime: 113769us
Mandelbrot: iterations=5 average: 114005us total:570029us


Total Runtime: 570029us

❯ java -classpath Java/benchmarks.jar Harness Mandelbrot 5 750
Starting Mandelbrot benchmark ...
Mandelbrot: iterations=1 runtime: 184574us
Mandelbrot: iterations=1 runtime: 179949us
Mandelbrot: iterations=1 runtime: 130062us
Mandelbrot: iterations=1 runtime: 179491us
Mandelbrot: iterations=1 runtime: 179937us
Mandelbrot: iterations=5 average: 170802us total: 854013us


Total Runtime: 854013us

❯ node JavaScript/harness.js Mandelbrot 5 750
Starting Mandelbrot benchmark ...
Mandelbrot: iterations=1 runtime: 116836us
Mandelbrot: iterations=1 runtime: 112234us
Mandelbrot: iterations=1 runtime: 114767us
Mandelbrot: iterations=1 runtime: 112838us
Mandelbrot: iterations=1 runtime: 114834us
Mandelbrot: iterations=5 average: 114302us total: 571509us


Total Runtime: 571509us

Node and Rust are about equal, but even after trying many runs of the java one (I always run many of each and take the best, but I ran the java one about 10 times more than that) it seemed stuck at that surprisingly higher value, java 17.0.1 still...

Grabbing java 18 just to see... no still same values, how odd...

Apr 08 '22 21:04 OvermindDL1

And now doing the List one.

...ow, this hurts me to write, this is... not just not idiomatic rust, this is so inefficient, lol... All the needless memory allocations and shared pointers and all... ^.^;

Well, the results of this amazingly reference-counted shared-pointer multiple-ownership horror, lol:

❯ Rust/target/release/harness List 5 750
Starting List benchmark ...
List: iterations=1 runtime: 19536us
List: iterations=1 runtime: 19455us
List: iterations=1 runtime: 19477us
List: iterations=1 runtime: 19410us
List: iterations=1 runtime: 19580us
List: iterations=5 average: 19492us total:97461us


Total Runtime: 97461us

(Still using Java 18 for note since I updated it in the last message)

❯ java -classpath Java/benchmarks.jar Harness List 5 750
Starting List benchmark ...
List: iterations=1 runtime: 28507us
List: iterations=1 runtime: 21232us
List: iterations=1 runtime: 21681us
List: iterations=1 runtime: 21281us
List: iterations=1 runtime: 21227us
List: iterations=5 average: 22785us total: 113928us


Total Runtime: 113928us

❯ node JavaScript/harness.js List 5 750
Starting List benchmark ...
List: iterations=1 runtime: 37219us
List: iterations=1 runtime: 33721us
List: iterations=1 runtime: 32754us
List: iterations=1 runtime: 32395us
List: iterations=1 runtime: 32519us
List: iterations=5 average: 33722us total: 168608us


Total Runtime: 168608us

Apr 08 '22 21:04 OvermindDL1

And the last short one (3 big ones left after this) as I'm just about out of time for today is Bounce, it's just pure math too so I'd expect them all to be about equal, results:

❯ Rust/target/release/harness Bounce 5 750                
Starting Bounce benchmark ...
Bounce: iterations=1 runtime: 14314us
Bounce: iterations=1 runtime: 14745us
Bounce: iterations=1 runtime: 14758us
Bounce: iterations=1 runtime: 14743us
Bounce: iterations=1 runtime: 17703us
Bounce: iterations=5 average: 15253us total:76265us


Total Runtime: 76265us

❯ java -classpath Java/benchmarks.jar Harness Bounce 5 750
Starting Bounce benchmark ...
Bounce: iterations=1 runtime: 33581us
Bounce: iterations=1 runtime: 16207us
Bounce: iterations=1 runtime: 17237us
Bounce: iterations=1 runtime: 16462us
Bounce: iterations=1 runtime: 21529us
Bounce: iterations=5 average: 21003us total: 105016us


Total Runtime: 105016us

❯ node JavaScript/harness.js Bounce 5 750
Starting Bounce benchmark ...
Bounce: iterations=1 runtime: 37600us
Bounce: iterations=1 runtime: 23069us
Bounce: iterations=1 runtime: 23056us
Bounce: iterations=1 runtime: 23465us
Bounce: iterations=1 runtime: 22913us
Bounce: iterations=5 average: 26021us total: 130103us


Total Runtime: 130103us

Huh, that's a much larger variance than I anticipated... I wonder why... The odd bouncing timings of the java one was odd too but I ran that one a few dozen times as well and that same pattern held for all of them, the above was the best time of those as always.

Apr 08 '22 22:04 OvermindDL1

I tried to work on Havlak a bit yesterday but, uh... I think there are a couple of bugs with it and rust doesn't like them enough that it refuses to compile it, so going to have to restructure it a touch before I can compile it (Rust really really doesn't like alias bugs), and although the Java code it is based on 'works' for the benchmark, it's broken enough that if it were used for anything more than the simple benchmark then it wouldn't really work as it is. So this will be the most 'different' one from java but it will still be extremely non-idiomatic rust with plenty of needless allocations and questionably buggy java-like constructs (which if I were to do it idiomatically then I'd grab the petgraph library and implement the loop detector on it, but still...). I might get time today to work more on it.

As an aside, since I had to implement more of the som things like Dictionary and so forth, wow these are inefficient! I did a quick benchmark between Dictionary/Set/etc and the basic stock general rust versions that have all the same functionality those have and a lot more and the speed difference is surprisingly astounding, those som things are so slow... From perfing it it seems its because of the extremely inefficient memory access patterns they do, but using them in the benchmarks and not the rust built-in ones for consistency with the other languages at least.

Apr 12 '22 14:04 OvermindDL1

it's broken enough that if it were used for anything more than the simple benchmark then it wouldn't really work as it is.

Anything specific that might be worth fixing in the other languages, too? This code was originally taken from a Go vs Java vs X paper.

wow these are inefficient

Yes, indeed they are. And some of the benchmarks degenerate into benchmarking these collections, unfortunately. Though, it's really hard to achieve anything comparable across languages, too. So, there's a tradeoff between portability, predictability, comparability, and the amount of code that one needs to port.

If someone has an idea of how to improve things in a portable way, without starting to rely on non-deterministic things, i.e., we can't rely on anything like hash values derived from memory locations etc, I am very open to improvements :)

Apr 12 '22 14:04 smarr

Anything specific that might be worth fixing in the other languages, too?

The main issues are aliasing issues. The benchmarking is using it in 'just' such a way that it's not an issue, but if it were an actual library for production use it would be easy to accidentally put it into a bad state. Not sure it's worth worrying about right now since fixing that would involve a little bit of extra code, which would slow it down (barely, but still), Rust can't do that though, it must be safe and correct so it has some more checks (though my checks are more of just kill the program if something bad happens since I have to do "something", it's sufficient for the benchmark since those aren't hit for it).

And some of the benchmarks degenerate into benchmarking these collections, unfortunately.

Uh.... yeah in Havlak here in both Java and Rust the som.Set collection, which is just a som.Vector that it iterates over on every-single-insert is absolutely *DOMINATING the time in my profilers when run on higher times. Let me check again, 76.58% of the time in the Rust version of Havlak, just for the contains check inside the som.Set.add call. I'm thinking someone probably meant to make Set wrap a Dictionary or something (and the som.Dictionary is pretty bad on its own as well but its better than repeatedly iterating a giant Vector an exponential amount of times) instead of a Vector because this makes no sense whatsoever...

If someone has an idea of how to improve things in a portable way, without starting to rely on non-deterministic things, i.e., we can't rely on anything like hash values derived from memory locations etc, I am very open to improvements :)

Honestly I'm quite good with using the standard collections included in languages, they are what's most often used after all for each language so it is would people would expect. However, for consistency probably just preallocating big arrays ahead of time (since we know the max possible runtime size of these things) seems much better, in addition to the occasional helper function to do something like get a bucket for a set/dictionary in such an array or so.

At the very least the som.Set needs to not use a vector, that's horrifying... >.> Maybe just get rid of the som.Set and use som.Dictionary instead with a who-cares value, that's usually how Set's are implemented (that's actually precisely how Rust's HashSet is defined in its standard library, it's just defined as the standard HashMap with an empty value).

If you care about hashing consistency though I'd be surprised if every language didn't have a way to use custom hash functions, I know both java and rust do at least in their standard libraries, and for the few languages that don't I guess they could define a custom hashmap instead? Using a custom hasher in Rust is pretty common as the standard library one is a cryptographically seeded hasher for randomness to prevent DOS attacks via known bad hash keys, but most uses of the hashmap doesn't need such security so they generally select a different hasher than the default cryptographically seeded one.

Apr 12 '22 19:04 OvermindDL1

Just as a test, I changed the rust one here from having som.Set implemented from a som.Vector to instead using a som.Dictionary with a nothing key, and that changed the set times from dominating at over 76% of the total runtime at 15000 loops of the cost to now taking 0.29% of the total runtime, so yeah at least that change should be done to every benchmark especially with how simple of a change it is in decent languages at least. ^.^

Apr 12 '22 20:04 OvermindDL1

And I changed my local java as well to also use Dictionary inside the Set (and fixed the CustomHash implementation, that was veeeery non-standard java and such was impossible to use for uncontrolled types, fixed here), and it had a good drop in times too.

With using a Dictionary as the internal Set implementation now the times for the Rust and Java ones are vastly more logical, and thus are:

❯ Rust/target/release/harness Havlak 5 15000
Starting Havlak benchmark ...
Havlak: iterations=1 runtime: 287109us
Havlak: iterations=1 runtime: 287246us
Havlak: iterations=1 runtime: 278570us
Havlak: iterations=1 runtime: 278709us
Havlak: iterations=1 runtime: 277514us
Havlak: iterations=5 average: 281830us total: 1409150us


Total Runtime: 1409150us

❯ java -classpath Java/benchmarks.jar Harness Havlak 5 15000
Starting Havlak benchmark ...
Havlak: iterations=1 runtime: 520327us
Havlak: iterations=1 runtime: 270325us
Havlak: iterations=1 runtime: 172282us
Havlak: iterations=1 runtime: 210959us
Havlak: iterations=1 runtime: 237777us
Havlak: iterations=5 average: 282334us total: 1411670us


Total Runtime: 1411670us

Essentially identical, which is about what I would expect considering the work being done. That Set implementation was so bad... ^.^;

Not committing this though because I changed things that are not Rust, even though I do think it is changes that needs to be done.

Here's the patch for the changed java code, unsure if it's "complete" but it was enough to compile everything and run the benchmarks faster (probably needs a formatter pass ran over it, I just quick edited it in vim):

diff --git a/benchmarks/Java/src/deltablue/Strength.java b/benchmarks/Java/src/deltablue/Strength.java
index 0599392..7953a04 100644
--- a/benchmarks/Java/src/deltablue/Strength.java
+++ b/benchmarks/Java/src/deltablue/Strength.java
@@ -19,7 +19,7 @@ import som.IdentityDictionary;
  */
 public final class Strength {
 
-  public static class Sym implements CustomHash {
+  public static class Sym {
 
     private final int hash;
 
@@ -27,7 +27,6 @@ public final class Strength {
       this.hash = hash;
     }
 
-    @Override
     public int customHash() {
       return hash;
     }
@@ -80,7 +79,7 @@ public final class Strength {
   }
 
   private static IdentityDictionary<Sym, Integer> createStrengthTable() {
-    IdentityDictionary<Sym, Integer> strengthTable = new IdentityDictionary<>();
+    IdentityDictionary<Sym, Integer> strengthTable = new IdentityDictionary<>(obj -> obj.customHash());
     strengthTable.atPut(ABSOLUTE_STRONGEST, -10000);
     strengthTable.atPut(REQUIRED,           -800);
     strengthTable.atPut(STRONG_PREFERRED,   -600);
@@ -93,7 +92,7 @@ public final class Strength {
   }
 
   private static IdentityDictionary<Sym, Strength> createStrengthConstants() {
-    IdentityDictionary<Sym, Strength> strengthConstant = new IdentityDictionary<>();
+    IdentityDictionary<Sym, Strength> strengthConstant = new IdentityDictionary<>(obj -> obj.customHash());
     strengthTable.getKeys().forEach(key ->
       strengthConstant.atPut(key, new Strength(key))
     );
diff --git a/benchmarks/Java/src/havlak/BasicBlock.java b/benchmarks/Java/src/havlak/BasicBlock.java
index 2154af1..7d01847 100644
--- a/benchmarks/Java/src/havlak/BasicBlock.java
+++ b/benchmarks/Java/src/havlak/BasicBlock.java
@@ -24,7 +24,7 @@ import som.Vector;
  *
  * @author rhundt
  */
-final class BasicBlock implements CustomHash {
+final class BasicBlock  {
 
   private final Vector<BasicBlock> inEdges;
   private final Vector<BasicBlock> outEdges;
@@ -56,7 +56,6 @@ final class BasicBlock implements CustomHash {
     inEdges.append(from);
   }
 
-  @Override
   public int customHash() {
     return name;
   }
diff --git a/benchmarks/Java/src/havlak/HavlakLoopFinder.java b/benchmarks/Java/src/havlak/HavlakLoopFinder.java
index 2eea9a2..98c1514 100644
--- a/benchmarks/Java/src/havlak/HavlakLoopFinder.java
+++ b/benchmarks/Java/src/havlak/HavlakLoopFinder.java
@@ -37,7 +37,7 @@ final class HavlakLoopFinder {
 
   private final Vector<Set<Integer>>  nonBackPreds = new Vector<Set<Integer>>();
   private final Vector<Vector<Integer>> backPreds  = new Vector<>();
-  private final IdentityDictionary<BasicBlock, Integer> number = new IdentityDictionary<>();
+  private final IdentityDictionary<BasicBlock, Integer> number = new IdentityDictionary<>(obj -> obj.customHash());
   private int                      maxSize = 0;
   private int[]                    header;
   private BasicBlockClass[]        type;
@@ -181,7 +181,7 @@ final class HavlakLoopFinder {
     }
 
     for (int i = 0; i < size; ++i) {
-      nonBackPreds.append(new Set<>());
+      nonBackPreds.append(new Set<>(obj -> obj));
       backPreds.append(new Vector<>());
       nodes[i] = new UnionFindNode();
     }
diff --git a/benchmarks/Java/src/havlak/SimpleLoop.java b/benchmarks/Java/src/havlak/SimpleLoop.java
index ddb3ffd..03a536c 100644
--- a/benchmarks/Java/src/havlak/SimpleLoop.java
+++ b/benchmarks/Java/src/havlak/SimpleLoop.java
@@ -37,6 +37,8 @@ import som.IdentitySet;
  * a candidate for transformations, and what not.
  */
 final class SimpleLoop {
+  private static int uid = 0;
+  private final int id;
 
   private final IdentitySet<BasicBlock> basicBlocks;
   private final IdentitySet<SimpleLoop> children;
@@ -53,13 +55,15 @@ final class SimpleLoop {
 
 
   SimpleLoop(final BasicBlock bb, final boolean isReducible) {
+    id = uid;
+    uid++;
     this.isReducible = isReducible;
     parent = null;
     isRoot = false;
     nestingLevel = 0;
     depthLevel   = 0;
-    basicBlocks  = new IdentitySet<>();
-    children     = new IdentitySet<>();
+    basicBlocks  = new IdentitySet<>(obj -> obj.customHash());
+    children     = new IdentitySet<>(obj -> obj.id);
 
     if (bb != null) {
       basicBlocks.add(bb);
diff --git a/benchmarks/Java/src/som/Dictionary.java b/benchmarks/Java/src/som/Dictionary.java
index 01bebd2..7d6ebcc 100644
--- a/benchmarks/Java/src/som/Dictionary.java
+++ b/benchmarks/Java/src/som/Dictionary.java
@@ -25,16 +25,17 @@ package som;
 import som.Dictionary.CustomHash;
 
 
-public class Dictionary<K extends CustomHash, V> {
+public class Dictionary<K, V> {
 
-  public interface CustomHash {
-    int customHash();
+  public interface CustomHash<T> {
+    int customHash(T obj);
   }
 
   protected static final int INITIAL_CAPACITY = 16;
 
   private Entry<K, V>[] buckets;
-  private int          size;
+  private int           size;
+  private CustomHash<K> hasher;
 
   static class Entry<K, V> {
 
@@ -56,19 +57,20 @@ public class Dictionary<K extends CustomHash, V> {
   }
 
   @SuppressWarnings("unchecked")
-  public Dictionary(final int size) {
+  public Dictionary(final int size, CustomHash<K> hasher) {
     this.buckets = new Entry[size];
+    this.hasher = hasher;
   }
 
-  public Dictionary() {
-    this(INITIAL_CAPACITY);
+  public Dictionary(CustomHash<K> hasher) {
+    this(INITIAL_CAPACITY, hasher);
   }
 
-  private static <K extends CustomHash> int hash(final K key) {
+  private int hash(final K key) {
     if (key == null) {
       return 0;
     }
-    int hash = key.customHash();
+    int hash = hasher.customHash(key);
     return hash ^ hash >>> 16;
   }
 
@@ -244,4 +246,14 @@ public class Dictionary<K extends CustomHash, V> {
     }
     return values;
   }
+
+  public void forEachKey(final ForEachInterface<K> fn) {
+    for (int i = 0; i < buckets.length; ++i) {
+      Entry<K, V> current = buckets[i];
+      while (current != null) {
+        fn.apply(current.key);
+        current = current.next;
+      }
+    }
+  }
 }
diff --git a/benchmarks/Java/src/som/IdentityDictionary.java b/benchmarks/Java/src/som/IdentityDictionary.java
index 7d49b50..7f0d5d6 100644
--- a/benchmarks/Java/src/som/IdentityDictionary.java
+++ b/benchmarks/Java/src/som/IdentityDictionary.java
@@ -3,7 +3,7 @@ package som;
 import som.Dictionary.CustomHash;
 
 
-public class IdentityDictionary<K extends CustomHash, V> extends Dictionary<K, V> {
+public class IdentityDictionary<K, V> extends Dictionary<K, V> {
 
   static class IdEntry<K, V> extends Entry<K, V> {
     IdEntry(final int hash, final K key, final V value, final Entry<K, V> next) {
@@ -16,12 +16,12 @@ public class IdentityDictionary<K extends CustomHash, V> extends Dictionary<K, V
     }
   }
 
-  public IdentityDictionary(final int size) {
-    super(size);
+  public IdentityDictionary(final int size, Dictionary.CustomHash<K> hasher) {
+    super(size, hasher);
   }
 
-  public IdentityDictionary() {
-    super(INITIAL_CAPACITY);
+  public IdentityDictionary(Dictionary.CustomHash<K> hasher) {
+    super(INITIAL_CAPACITY, hasher);
   }
 
   @Override
diff --git a/benchmarks/Java/src/som/IdentitySet.java b/benchmarks/Java/src/som/IdentitySet.java
index a699d4a..7ab0872 100644
--- a/benchmarks/Java/src/som/IdentitySet.java
+++ b/benchmarks/Java/src/som/IdentitySet.java
@@ -25,16 +25,16 @@ package som;
 
 public final class IdentitySet<E> extends Set<E> {
 
-  public IdentitySet() {
-    super();
+  public IdentitySet(Dictionary.CustomHash<E> hasher) {
+    super(hasher);
   }
 
-  public IdentitySet(final int size) {
-    super(size);
+  public IdentitySet(final int size, Dictionary.CustomHash<E> hasher) {
+    super(size, hasher);
   }
 
-  @Override
-  public boolean contains(final E obj) {
-    return hasSome(e -> { return e == obj; });
-  }
+//  @Override
+//  public boolean contains(final E obj) {
+//    return hasSome(e -> { return e == obj; });
+//  }
 }
diff --git a/benchmarks/Java/src/som/Set.java b/benchmarks/Java/src/som/Set.java
index 8e52468..78ead9d 100644
--- a/benchmarks/Java/src/som/Set.java
+++ b/benchmarks/Java/src/som/Set.java
@@ -23,14 +23,14 @@
 package som;
 
 public class Set<E> {
-  private final Vector<E> items;
+  private final Dictionary<E, Void> items;
 
-  public Set() {
-    this(Constants.INITIAL_SIZE);
+  public Set(Dictionary.CustomHash<E> hasher) {
+    this(Constants.INITIAL_SIZE, hasher);
   }
 
-  public Set(final int size) {
-    items = new Vector<E>(size);
+  public Set(final int size, Dictionary.CustomHash<E> hasher) {
+    items = new Dictionary<E, Void>(size, hasher);
   }
 
   public int size() {
@@ -38,21 +38,11 @@ public class Set<E> {
   }
 
   public void forEach(final ForEachInterface<E> fn) {
-    items.forEach(fn);
-  }
-
-  public boolean hasSome(final TestInterface<E> fn) {
-    return items.hasSome(fn);
-  }
-
-  public E getOne(final TestInterface<E> fn) {
-    return items.getOne(fn);
+    items.forEachKey(fn);
   }
 
   public void add(final E obj) {
-    if (!contains(obj)) {
-      items.append(obj);
-    }
+    items.atPut(obj, null);
   }
 
   public <T> Vector<T> collect(final CollectInterface<E, T> fn) {
@@ -65,7 +55,7 @@ public class Set<E> {
   }
 
   public boolean contains(final E obj) {
-    return hasSome(e -> { return e.equals(obj); });
+    return items.containsKey(obj);
   }
 
   public void removeAll() {

Apr 12 '22 21:04 OvermindDL1

are-we-fast-yet are-we-fast-yet copied to clipboard

Rust

are-we-fast-yet
are-we-fast-yet copied to clipboard