benchmarks icon indicating copy to clipboard operation
benchmarks copied to clipboard

Enhancing Rust version

Open Pzixel opened this issue 4 years ago • 7 comments

I compared Rust and Cpp versions and I found out that most of performance difference is branch misprediction. It also drastically changes from CPU to CPU becuase it relies on CPU internals.

I found that changing match *op on if let chain fixes this problem. Although I'm notsure if it follows your "Used data structures are idiomatic.". I mean it's more idiomatic to use match in these cases but what makes Rust itself is you could rewrite your code a bit to get better performance.

What do you think? On my machine I've got 1.45 rust vs 1.388 cpp

Pzixel avatar Dec 02 '19 10:12 Pzixel

Pattern matching is used everywhere in the tests when it's possible, and the average developer would do the same - they won't do micro-optimizations unless necessary. I'm not even sure that clippy won't complain about a bunch of if let instead of match, so we may assume that match would be used in the majority of the use-cases when Rust developers have to match more than just a couple of patterns.

And the actual question is "why match's code is slower than if-let's" (it could be related to branch misprediction I guess). If this question is answered and probably addressed (even by changing the Rust internals), the whole community benefits of it. That's the ultimate goal of the benchmarks - improve the existing systems. However, if you're looking into how to squeeze the best performance by utilizing various and potentially unsafe features, then, unfortunately, this project is not about that. Please note though that it's possible to submit another implementation in addition to the reference one that could work faster. So, if you're willing, you're welcome to submit a faster code, but please note that it would be an additional test.

nuald avatar Dec 02 '19 17:12 nuald

I understand the goal. Unfortunately current implementation of likely works with expressions only (see https://github.com/rust-lang/rust/issues/26179 ), so it couldn't be applied to the match branches. So the only way to get correct branches compilation is manually write the flow via if let to force specific expansion order.

I understand that this change may go against the spirit of the project, this is why I created an issue instead of PR.

Please note though that it's possible to submit another implementation in addition to the reference one that could work faster. So, if you're willing, you're welcome to submit a faster code, but please note that it would be an additional test.

That's fine. Although I don't think there is any reason to place Better Rust because results table will soon become a mess of "Cpp 534 vs Rust Alaha version" with thousands of alternatives. Keeping one most friendly and idiomatic is the best.

The only thing I can suggest is placeing #[inline(always)] to be on par with C++'s inlines in cpp bench. It saves some ticks on my machine. And it's a completely valid change which I see in lots of crates.

Pzixel avatar Dec 02 '19 20:12 Pzixel

Thank you for the detail explanation. Please let me note though:

  • likely looks like a micro-optimization, and C++ version doesn't use it. If C++ code has better predictions rate, than it seems like Rust has a little problem with the optimization of match expression (assuming that if let are predicted better).
  • I've missed the inline keyword in C++, I guess I would rather delete inline from there and would allow the compiler decide. And again, if Rust doesn't inline those simple functions, it's just a problem with heuristics and I hope they'll fix it eventually.

nuald avatar Dec 03 '19 00:12 nuald

likely looks like a micro-optimization, and C++ version doesn't use it. If C++ code has better predictions rate, than it seems like Rust has a little problem with the optimization of match expression (assuming that if let are predicted better).

There is no way CPP could predict which branch will happen more often. But JIT languages can. This is why Kotlin is this good at brainfuck2 benchmarks. After it hits enough times it recompiles the functon with statistics available. And in this statistics it sees that two branches are almost never hit so it generates much better code than any native language (that don't have runtime profiling information).

So there is nothing to "fix eventually" because there is no way to detemine it without profile-guided optimization which could be considered as an optimisation effort (thus not naive idiomatic solution) itself. I'm not arguing, I'm just explaining what I see here. Btw I wonder why Cpp even compiles differently. This is another study case for me.

I guess I would rather delete inline from there

Sounds good to me :)

Pzixel avatar Dec 03 '19 08:12 Pzixel

Btw I wonder why Cpp even compiles differently. This is another study case for me.

This would interest me as well - come back once you found something, thanks.

dumblob avatar Dec 03 '19 08:12 dumblob

If JIT were so good, Kotlin would work excellent on the bigger BF script too, but it doesn't, so I guess that it's not the whole story. On the other hand, considering that Kotlin removes some runtime checks (OpenJDK gives warning "OpenJDK 64-Bit Server VM warning: Options -Xverify:none and -noverify were deprecated in JDK 13 and will likely be removed in a future release."), and as those checks could screw up the branch predictions, it explains why it performs better than other JVM-based tests.

I think it's still worth to have the optimized Rust version with likely and inline - it may help to determine and eliminate the weak points. If natively compiled optimized version performs worse than JIT-ed program, then it's definitely an issue that would be nice to address. It could be some compiler issues, like noticed in https://parallel-rust-cpp.github.io/v6.html - Clang and Rustc version should be identical, but they're not. My other suggestion would be tweaking the compiler flags (like using target-cpu as JIT could do the same, generating code specific to the current CPU). However, it's definitely up to you.

nuald avatar Dec 03 '19 18:12 nuald

I guess I would rather delete inline from there

Sounds good to me :)

Addressed in #209 . Please note that removing inline actually improved performance (I double checked), looks like compiler could do better job than human here. However, it's GCC and on my server, it could be different for you.

nuald avatar Dec 03 '19 19:12 nuald