1brc Various improvements royvanrijn

I've tried making a version that scans in both directions; one starts at the beginning of a segment, the other at the end and processes backwards. This has one big advantage: you can update the pivot if one thread is faster than the other, you'll get "free" work-sharing. But I couldn't get the backwards logic to work reliably, so I ditched the idea for now.

Now instead I've implemented a kind of vectorized unrolled SWAR technique, hopefully Graal is smart enough to figure it out haha. It works on my MacBook M2:

Leaderboard

Result (m:s.ms)	Implementation	JDK	Submitter	Notes
00:01.193	link	21.0.1-graal	Roy van Rijn	GraalVM native binary
00:01.412	link	21.0.1-graal	Thomas Wuerthinger	GraalVM native binary
00:02.011	link	21.0.1-open	Quan Anh Mai

Check List:

[X] Tests pass (./test.sh <username> shows no differences between expected and actual outputs)
[X] All formatting changes by the build are committed
[X] Your launch script is named calculate_average_<username>.sh (make sure to match casing of your GH user name) and is executable
[X] Output matches that of calculate_average_baseline.sh

Execution time: 1.210 s
Execution time of reference implementation: Previous version ran in 1.260 s

Jan 15 '24 15:01 royvanrijn

Doesn't seem to move the needle on the eval machine, a tad slower maybe even. Heres's the result from ten results of the current Top 3 (yours as per this PR, added the other two for reference):

Benchmark 1: timeout -v 300 ./calculate_average_royvanrijn.sh 2>&1
  Time (mean ± σ):      2.873 s ±  0.073 s    [User: 19.457 s, System: 0.748 s]
  Range (min … max):    2.784 s …  2.972 s    10 runs

Summary
  merykittyunsafe: trimmed mean 2.570922883885, raw times 2.56644442376,2.54621641876,2.51756642776,2.60756992176,2.60652918376,2.57729567076,2.59114373176,2.51674740976,2.55461729276,2.60960316576
  thomaswue: trimmed mean 2.70209734543, raw times 2.69491738518,2.71528659318,2.70493814218,2.68981020918,2.6951709371800003,2.69859319518,2.72307721518,2.69905854718,2.71900375418,2.68676762818
  royvanrijn: trimmed mean 2.871354856865, raw times 2.9674076887400003,2.9721756577400003,2.90254687074,2.8090787627400005,2.9175244417400004,2.78433209274,2.9313450097400002,2.81469274174,2.8198787917400003,2.80836454774

Leaderboard

| # | Result (m:s.ms) | Implementation     | JDK | Submitter     | Notes     |
|---|-----------------|--------------------|-----|---------------|-----------|
|   | 00:02.570 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_merykittyunsafe.java)| 21.0.1-open | [merykittyunsafe](https://github.com/merykittyunsafe) |  |
|   | 00:02.702 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java)| 21.0.1-graal | [Thomas Wuerthinger](https://github.com/thomaswue) | GraalVM native binary |
|   | 00:02.871 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java)| 21.0.1-graal | [Roy van Rijn](https://github.com/royvanrijn) | GraalVM native binary |

Jan 15 '24 19:01 gunnarmorling

That’s annoying haha, improving locally but getting slightly worse here.

Jan 15 '24 19:01 royvanrijn

Hehe, yeah. I'm wondering whether there's a way we actually could set up a GH Action runner on the machine (which can be enabled/disabled, so as to not interfere with official evaluation runs), which would allow some interested folks to trigger runs themselves and e.g. display perf stat output too.

Jan 15 '24 19:01 gunnarmorling

@gunnarmorling Tried a different approach, could you try this one? Also I'm interested in seeing the pref stats if you have them.

Oh, and the option of having a GH Action would be awesome haha.

Jan 16 '24 14:01 royvanrijn

Same ballpark as before:

Benchmark 1: timeout -v 300 ./calculate_average_royvanrijn.sh 2>&1
  Time (mean ± σ):      2.872 s ±  0.067 s    [User: 19.563 s, System: 0.726 s]
  Range (min … max):    2.811 s …  2.949 s    5 runs

Summary
  royvanrijn: trimmed mean 2.8661954997466665, raw times 2.93671958308,2.81082788208,2.85103056908,2.9493011500799997,2.81083634708

Leaderboard

| # | Result (m:s.ms) | Implementation     | JDK | Submitter     | Notes     |
|---|-----------------|--------------------|-----|---------------|-----------|
|   | 00:02.866 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java)| 21.0.1-graal | [Roy van Rijn](https://github.com/royvanrijn) | GraalVM native binary |

Jan 17 '24 20:01 gunnarmorling

That's so odd, curious.

Jan 17 '24 21:01 royvanrijn

I would not expect the "flattening" of the hash table entry array to make a difference, because the array is very sparsely populated. Therefore the structure of sparse array of compressed pointers plus co-allocated objects should have more memory locality making up for the additional pointer "hop". Does this change make a difference on your ARM machine?

Jan 18 '24 00:01 thomaswue

It didn't make a huge difference, but allowed me to have a little more control, and I was looking to pre-load them into memory to keep them close together.

Jan 19 '24 16:01 royvanrijn

1brc 1brc copied to clipboard

Various improvements royvanrijn

Check List:

1brc
1brc copied to clipboard