1brc
1brc copied to clipboard
One last improvement for thomaswue
Check List:
- [x] You have run
./mvnw verify
and the project builds successfully - [x] Tests pass (
./test.sh <username>
shows no differences between expected and actual outputs) - [x] All formatting changes by the build are committed
- [x] Your launch script is named
calculate_average_<username>.sh
(make sure to match casing of your GH user name) and is executable - [x] Output matches that of
calculate_average_baseline.sh
- [x] For new entries, or after substantial changes: When implementing custom hash structures, please point to where you deal with hash collisions (line number)
- Execution time: 0.31
- Execution time of reference implementation: 120.37
OK, so can't let @jerrinot have all the fun alone ;-). I adopted the approach that combines the <8 and 1-16 cases to avoid branch misprediction, which was the biggest improvement; and then also the masking instead of bit shifting, which added another few %. Added @jerrinot and @abeobk to the credits section. This should now get into the ballpark of @jerrinot's submission if the results from my machine don't lie. Major differences to his solution are that I am using Result objects for the hash table instead of unsafe access and I am unrolling 3 times. Also, it avoids the name length check by having the delimiter in the word mask.
@gunnarmorling Added one last code layout change that improves by 1-2%, but that is really it now ;-).
@thomaswue challange accepted! also: can we both agree that if you put null bytes into a string then you deserve to suffer? :)
Wait, now everbody switches to processing 16 bytes? I was experimenting with that more than a week ago haha. 🤦🏻♂️
@royvanrijn :)))) I have no idea how you managed to do so well while benchmarking on M2! 🙇 🪄 I found perf. results on ARM to not be representative at all.
I included the 16-byte processing in this PR. @mtopolnik had been playing with a similar thing even before. It performed quite well from the start, but 2 major things were holding it back:
- bit-twiddling increased the total instruction count significantly (as visible in perf stat attached to the PR). this was eventually solved by using @abeobk's lookup tables.
- the state was too large and the compiler resorted to stack spilling. @franz1981 and @thomaswue helped me to identify it.
Based on the various experiments around the 16-byte processing, I also thought it would not benefit so much. But it seems like for my solution it was the last missing piece. It is amazing how once other bottlenecks are solved, suddenly a certain optimization can be super critical.
It completely changes the bottlenecks identified by perf stat from a lot of bad speculation:
Performance counter stats for './target/CalculateAverage_thomaswue_image':
12,061.39 msec task-clock # 30.313 CPUs utilized
681 context-switches # 56.461 /sec
99 cpu-migrations # 8.208 /sec
220,203 page-faults # 18.257 K/sec
25,250,300,213 cpu_core/cycles/ # 2.093 GHz (93.88%)
26,877,576,823 cpu_atom/cycles/ # 2.228 GHz (46.36%)
42,740,848,935 cpu_core/instructions/ # 1.69 insn per cycle (93.88%)
57,425,448,002 cpu_atom/instructions/ # 2.27 insn per cycle (55.29%)
3,938,912,502 cpu_core/branches/ # 326.572 M/sec (93.88%)
5,271,137,390 cpu_atom/branches/ # 437.026 M/sec (55.61%)
246,089,883 cpu_core/branch-misses/ # 6.25% of all branches (93.88%)
329,280,193 cpu_atom/branch-misses/ # 8.36% of all branches (56.92%)
TopdownL1 (cpu_core) # 12.8 % tma_backend_bound
# 24.9 % tma_bad_speculation
# 7.9 % tma_frontend_bound
# 54.4 % tma_retiring (93.88%)
TopdownL1 (cpu_atom) # 39.8 % tma_bad_speculation (57.63%)
# 6.6 % tma_frontend_bound (57.99%)
# 12.9 % tma_backend_bound
# 12.9 % tma_backend_bound_aux (58.20%)
# 39.6 % tma_retiring (57.62%)
0.397893788 seconds time elapsed
0.000000000 seconds user
0.006337000 seconds sys
To almost no bad speculation and much higher instructions per cycle and almost no branch misses anymore:
Performance counter stats for './target/CalculateAverage_thomaswue_image':
9,778.04 msec task-clock # 30.927 CPUs utilized
811 context-switches # 82.941 /sec
81 cpu-migrations # 8.284 /sec
218,792 page-faults # 22.376 K/sec
21,654,496,541 cpu_core/cycles/ # 2.215 GHz (88.73%)
19,474,462,268 cpu_atom/cycles/ # 1.992 GHz (54.36%)
49,328,285,361 cpu_core/instructions/ # 2.28 insn per cycle (88.73%)
68,071,427,732 cpu_atom/instructions/ # 3.14 insn per cycle (59.45%)
3,690,652,954 cpu_core/branches/ # 377.443 M/sec (88.73%)
5,032,375,583 cpu_atom/branches/ # 514.661 M/sec (59.45%)
14,111,445 cpu_core/branch-misses/ # 0.38% of all branches (88.73%)
19,354,582 cpu_atom/branch-misses/ # 0.52% of all branches (59.46%)
TopdownL1 (cpu_core) # 18.7 % tma_backend_bound
# 2.2 % tma_bad_speculation
# 5.8 % tma_frontend_bound
# 73.3 % tma_retiring (88.73%)
TopdownL1 (cpu_atom) # 4.5 % tma_bad_speculation (59.45%)
# 1.8 % tma_frontend_bound (59.45%)
# 28.7 % tma_backend_bound
# 28.7 % tma_backend_bound_aux (59.48%)
# 66.7 % tma_retiring (60.12%)
Obviously, missed branches are bad. But what I did not expect is how bad they are. There are 500 million missed branches removed (which makes sense, because basically every 2nd row is a branch miss in the 8-byte vs 16-byte separate case version). The number of instructions for getting rid of them goes up by about 17 billion (!) and the new version is still more than 20% less cycles.
Nice! Amazing to see this level of improvement at this point. Thanks a lot for participating in 1BRC, and congrats on implementing one of the very top entries, and keeping pushing updates over all the month, @thomaswue!
Benchmark 1: timeout -v 300 ./calculate_average_thomaswue.sh 2>&1
Time (mean ± σ): 1.532 s ± 0.014 s [User: 0.002 s, System: 0.003 s]
Range (min … max): 1.495 s … 1.545 s 10 runs
Summary
thomaswue: trimmed mean 1.535478878755, raw times 1.49517163788,1.53168578688,1.5320476678800001,1.53248459588,1.53386057488,1.54497609988,1.53989302288,1.53518686588,1.5412673778800001,1.53740513788
Leaderboard
| # | Result (m:s.ms) | Implementation | JDK | Submitter | Notes |
|---|-----------------|--------------------|-----|---------------|-----------|
| | 00:01.535 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java)| 21.0.2-graal | [Thomas Wuerthinger](https://github.com/thomaswue) | GraalVM native binary, uses Unsafe |
@thomaswue congratulations!
Wow, cool, thank you, that is even faster on the evaluation server than expected! Nice to see how the solution can very much compete while keeping the core Result data structure in the style of a Java class.
As referenced in the credit and co-author section, I got a lot of inspiration and borrowed ideas from @merykitty, @mukel, @artsiomkorzun, @jerrinot, and @abeobk. Thank you guys, it was a blast, I enjoyed it a lot (as you could see)! And I also learned a lot. Some of those learnings will go into future Graal compiler optimizations and tunings for sure!
which translates in: know the pattern of your data, your domain, and you can get an immensely faster program. It would be nice to see now, after the rightly applied tradeoffs how an high-level java program could perform now, without relying on Unsafe
ie the right stride to parse, branchless parsing, decent locality (without Unsafe)
It should be possible to go without unsafe and without performance loss, but it will require some really careful crafting to make sure all bounds checks are combined/moved to mitigate their effect. The memory API also has a few more checks than just the bounds (thread owner, scope, temporal check). I tried to use it and looked at the compiler graph.
We will see if Graal compiler can optimize those specially maybe or if some API additions might be necessary.
The other aspect we will add to Graal for sure is auto-vectorization of the search loop. That would avoid all that SWAR/mask complexity and produce decent vector code for the search automatically. We do that for the String indexOf methods already, but not for general user code.
The other aspect we will add to Graal for sure is auto-vectorization of the search loop
that is indeed a very nice thing which C2 can achieve, sometime, under certain conditions (which is an interesting combination of words :P ), but I still believe that having an intrinsic for a declarative MemorySegment::indexOf(byte)
API to be a good idea as well, unless the autovectorization can always make it happen, regardless conditions like inlining-level etc etc
There are many cases where such things are needed (I've implemented manually SWAR on Netty's off-heap/on-heap exactly because I couldn't rely on ANY compiler to do int right, always), and I think others in the java middleware data-processing landscape would benefit about it (ElasticSearch? ).
Yes, I think the best approach is to implement it as a general compiler transform and offer optional utility methods that if used are guaranteed to trigger the optimization.
Obviously, missed branches are bad. But what I did not expect is how bad they are. There are 500 million missed branches removed (which makes sense, because basically every 2nd row is a branch miss in the 8-byte vs 16-byte separate case version). The number of instructions for getting rid of them goes up by about 17 billion (!) and the new version is still more than 20% less cycles.
That's very interesting!!!
The number of instructions
vs branch misses
is probably one of the reasons why some results are very different depending on the HW.
Also, OS configuration might play a role, because of the patches against meltdown and spectre vulnerabilities.
Hey @thomaswue, @merykitty, and @mukel!
Congrats again on being in the Top 3 of the One Billion Row Challenge!
To celebrate this amazing achievement, I would like to send you a 1BRC t-shirt and coffee mug. To claim your prize, fill out this form by Feb 18. After submitting the form, please provide a comment with the random value you've specified in the form, so that I know it is you who submitted it.
All data entered will solely be used in relation to processing this shipment. Shipments can be sent to any country listed here. A big thank you to Decodable for sponsoring these prizes!
Thanks a lot for participating in 1BRC,
--Gunnar
Hey @mtopolnik, thx for replying. I've tagged you on your own PR with the right link to the Top 20 form. Nice try to get that Top 3 t-shirt though ;)
Oops... I got notifications for a bunch of them and had all of them open in the browser.
Thank you so much @gunnarmorling! Looking forward to proudly wear that #1brc shirt in some of my next talks ;-).
Number is 9597 9145
Hi @gunnarmorling, my random tokens: 3728 28670