1brc
1brc copied to clipboard
Various improvements royvanrijn
I've tried making a version that scans in both directions; one starts at the beginning of a segment, the other at the end and processes backwards. This has one big advantage: you can update the pivot if one thread is faster than the other, you'll get "free" work-sharing. But I couldn't get the backwards logic to work reliably, so I ditched the idea for now.
Now instead I've implemented a kind of vectorized unrolled SWAR technique, hopefully Graal is smart enough to figure it out haha. It works on my MacBook M2:
Leaderboard
# | Result (m:s.ms) | Implementation | JDK | Submitter | Notes |
---|---|---|---|---|---|
00:01.193 | link | 21.0.1-graal | Roy van Rijn | GraalVM native binary | |
00:01.412 | link | 21.0.1-graal | Thomas Wuerthinger | GraalVM native binary | |
00:02.011 | link | 21.0.1-open | Quan Anh Mai |
Check List:
- [X] Tests pass (
./test.sh <username>
shows no differences between expected and actual outputs) - [X] All formatting changes by the build are committed
- [X] Your launch script is named
calculate_average_<username>.sh
(make sure to match casing of your GH user name) and is executable - [X] Output matches that of
calculate_average_baseline.sh
- Execution time:
1.210 s
- Execution time of reference implementation: Previous version ran in
1.260 s
Doesn't seem to move the needle on the eval machine, a tad slower maybe even. Heres's the result from ten results of the current Top 3 (yours as per this PR, added the other two for reference):
Benchmark 1: timeout -v 300 ./calculate_average_royvanrijn.sh 2>&1
Time (mean ± σ): 2.873 s ± 0.073 s [User: 19.457 s, System: 0.748 s]
Range (min … max): 2.784 s … 2.972 s 10 runs
Summary
merykittyunsafe: trimmed mean 2.570922883885, raw times 2.56644442376,2.54621641876,2.51756642776,2.60756992176,2.60652918376,2.57729567076,2.59114373176,2.51674740976,2.55461729276,2.60960316576
thomaswue: trimmed mean 2.70209734543, raw times 2.69491738518,2.71528659318,2.70493814218,2.68981020918,2.6951709371800003,2.69859319518,2.72307721518,2.69905854718,2.71900375418,2.68676762818
royvanrijn: trimmed mean 2.871354856865, raw times 2.9674076887400003,2.9721756577400003,2.90254687074,2.8090787627400005,2.9175244417400004,2.78433209274,2.9313450097400002,2.81469274174,2.8198787917400003,2.80836454774
Leaderboard
| # | Result (m:s.ms) | Implementation | JDK | Submitter | Notes |
|---|-----------------|--------------------|-----|---------------|-----------|
| | 00:02.570 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_merykittyunsafe.java)| 21.0.1-open | [merykittyunsafe](https://github.com/merykittyunsafe) | |
| | 00:02.702 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java)| 21.0.1-graal | [Thomas Wuerthinger](https://github.com/thomaswue) | GraalVM native binary |
| | 00:02.871 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java)| 21.0.1-graal | [Roy van Rijn](https://github.com/royvanrijn) | GraalVM native binary |
That’s annoying haha, improving locally but getting slightly worse here.
Hehe, yeah. I'm wondering whether there's a way we actually could set up a GH Action runner on the machine (which can be enabled/disabled, so as to not interfere with official evaluation runs), which would allow some interested folks to trigger runs themselves and e.g. display perf stat output too.
@gunnarmorling Tried a different approach, could you try this one? Also I'm interested in seeing the pref stats if you have them.
Oh, and the option of having a GH Action would be awesome haha.
Same ballpark as before:
Benchmark 1: timeout -v 300 ./calculate_average_royvanrijn.sh 2>&1
Time (mean ± σ): 2.872 s ± 0.067 s [User: 19.563 s, System: 0.726 s]
Range (min … max): 2.811 s … 2.949 s 5 runs
Summary
royvanrijn: trimmed mean 2.8661954997466665, raw times 2.93671958308,2.81082788208,2.85103056908,2.9493011500799997,2.81083634708
Leaderboard
| # | Result (m:s.ms) | Implementation | JDK | Submitter | Notes |
|---|-----------------|--------------------|-----|---------------|-----------|
| | 00:02.866 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java)| 21.0.1-graal | [Roy van Rijn](https://github.com/royvanrijn) | GraalVM native binary |
That's so odd, curious.
I would not expect the "flattening" of the hash table entry array to make a difference, because the array is very sparsely populated. Therefore the structure of sparse array of compressed pointers plus co-allocated objects should have more memory locality making up for the additional pointer "hop". Does this change make a difference on your ARM machine?
It didn't make a huge difference, but allowed me to have a little more control, and I was looking to pre-load them into memory to keep them close together.