1brc One last improvement for thomaswue

Check List:

[x] You have run ./mvnw verify and the project builds successfully
[x] Tests pass (./test.sh <username> shows no differences between expected and actual outputs)
[x] All formatting changes by the build are committed
[x] Your launch script is named calculate_average_<username>.sh (make sure to match casing of your GH user name) and is executable
[x] Output matches that of calculate_average_baseline.sh
[x] For new entries, or after substantial changes: When implementing custom hash structures, please point to where you deal with hash collisions (line number)

Execution time: 0.31
Execution time of reference implementation: 120.37

OK, so can't let @jerrinot have all the fun alone ;-). I adopted the approach that combines the <8 and 1-16 cases to avoid branch misprediction, which was the biggest improvement; and then also the masking instead of bit shifting, which added another few %. Added @jerrinot and @abeobk to the credits section. This should now get into the ballpark of @jerrinot's submission if the results from my machine don't lie. Major differences to his solution are that I am using Result objects for the hash table instead of unsafe access and I am unrolling 3 times. Also, it avoids the name length check by having the delimiter in the word mask.

Jan 31 '24 21:01 thomaswue

@gunnarmorling Added one last code layout change that improves by 1-2%, but that is really it now ;-).

Jan 31 '24 22:01 thomaswue

@thomaswue challange accepted! also: can we both agree that if you put null bytes into a string then you deserve to suffer? :)

Jan 31 '24 23:01 jerrinot

Wait, now everbody switches to processing 16 bytes? I was experimenting with that more than a week ago haha. 🤦🏻‍♂️

Feb 01 '24 06:02 royvanrijn

@royvanrijn :)))) I have no idea how you managed to do so well while benchmarking on M2! 🙇 🪄 I found perf. results on ARM to not be representative at all.

I included the 16-byte processing in this PR. @mtopolnik had been playing with a similar thing even before. It performed quite well from the start, but 2 major things were holding it back:

bit-twiddling increased the total instruction count significantly (as visible in perf stat attached to the PR). this was eventually solved by using @abeobk's lookup tables.
the state was too large and the compiler resorted to stack spilling. @franz1981 and @thomaswue helped me to identify it.

Feb 01 '24 08:02 jerrinot

Based on the various experiments around the 16-byte processing, I also thought it would not benefit so much. But it seems like for my solution it was the last missing piece. It is amazing how once other bottlenecks are solved, suddenly a certain optimization can be super critical.

It completely changes the bottlenecks identified by perf stat from a lot of bad speculation:

Performance counter stats for './target/CalculateAverage_thomaswue_image':
      12,061.39 msec task-clock                       #   30.313 CPUs utilized             
              681      context-switches                 #   56.461 /sec                      
               99      cpu-migrations                   #    8.208 /sec                      
          220,203      page-faults                      #   18.257 K/sec                     
   25,250,300,213      cpu_core/cycles/                 #    2.093 GHz                         (93.88%)
   26,877,576,823      cpu_atom/cycles/                 #    2.228 GHz                         (46.36%)
   42,740,848,935      cpu_core/instructions/           #    1.69  insn per cycle              (93.88%)
   57,425,448,002      cpu_atom/instructions/           #    2.27  insn per cycle              (55.29%)
    3,938,912,502      cpu_core/branches/               #  326.572 M/sec                       (93.88%)
    5,271,137,390      cpu_atom/branches/               #  437.026 M/sec                       (55.61%)
      246,089,883      cpu_core/branch-misses/          #    6.25% of all branches             (93.88%)
      329,280,193      cpu_atom/branch-misses/          #    8.36% of all branches             (56.92%)
            TopdownL1 (cpu_core)                 #     12.8 %  tma_backend_bound      
                                                 #     24.9 %  tma_bad_speculation    
                                                 #      7.9 %  tma_frontend_bound     
                                                 #     54.4 %  tma_retiring             (93.88%)
            TopdownL1 (cpu_atom)                 #     39.8 %  tma_bad_speculation      (57.63%)
                                                 #      6.6 %  tma_frontend_bound       (57.99%)
                                                 #     12.9 %  tma_backend_bound      
                                                 #     12.9 %  tma_backend_bound_aux    (58.20%)
                                                 #     39.6 %  tma_retiring             (57.62%)

      0.397893788 seconds time elapsed

      0.000000000 seconds user
      0.006337000 seconds sys

To almost no bad speculation and much higher instructions per cycle and almost no branch misses anymore:

 Performance counter stats for './target/CalculateAverage_thomaswue_image':
             9,778.04 msec task-clock                       #   30.927 CPUs utilized             
               811      context-switches                 #   82.941 /sec                      
                81      cpu-migrations                   #    8.284 /sec                      
           218,792      page-faults                      #   22.376 K/sec                     
    21,654,496,541      cpu_core/cycles/                 #    2.215 GHz                         (88.73%)
    19,474,462,268      cpu_atom/cycles/                 #    1.992 GHz                         (54.36%)
    49,328,285,361      cpu_core/instructions/           #    2.28  insn per cycle              (88.73%)
    68,071,427,732      cpu_atom/instructions/           #    3.14  insn per cycle              (59.45%)
     3,690,652,954      cpu_core/branches/               #  377.443 M/sec                       (88.73%)
     5,032,375,583      cpu_atom/branches/               #  514.661 M/sec                       (59.45%)
        14,111,445      cpu_core/branch-misses/          #    0.38% of all branches             (88.73%)
        19,354,582      cpu_atom/branch-misses/          #    0.52% of all branches             (59.46%)
             TopdownL1 (cpu_core)                 #     18.7 %  tma_backend_bound      
                                                  #      2.2 %  tma_bad_speculation    
                                                  #      5.8 %  tma_frontend_bound     
                                                  #     73.3 %  tma_retiring             (88.73%)
             TopdownL1 (cpu_atom)                 #      4.5 %  tma_bad_speculation      (59.45%)
                                                  #      1.8 %  tma_frontend_bound       (59.45%)
                                                  #     28.7 %  tma_backend_bound      
                                                  #     28.7 %  tma_backend_bound_aux    (59.48%)
                                                  #     66.7 %  tma_retiring             (60.12%)

Feb 01 '24 09:02 thomaswue

Obviously, missed branches are bad. But what I did not expect is how bad they are. There are 500 million missed branches removed (which makes sense, because basically every 2nd row is a branch miss in the 8-byte vs 16-byte separate case version). The number of instructions for getting rid of them goes up by about 17 billion (!) and the new version is still more than 20% less cycles.

Feb 01 '24 09:02 thomaswue

Nice! Amazing to see this level of improvement at this point. Thanks a lot for participating in 1BRC, and congrats on implementing one of the very top entries, and keeping pushing updates over all the month, @thomaswue!

Benchmark 1: timeout -v 300 ./calculate_average_thomaswue.sh 2>&1
  Time (mean ± σ):      1.532 s ±  0.014 s    [User: 0.002 s, System: 0.003 s]
  Range (min … max):    1.495 s …  1.545 s    10 runs

Summary
  thomaswue: trimmed mean 1.535478878755, raw times 1.49517163788,1.53168578688,1.5320476678800001,1.53248459588,1.53386057488,1.54497609988,1.53989302288,1.53518686588,1.5412673778800001,1.53740513788

Leaderboard

| # | Result (m:s.ms) | Implementation     | JDK | Submitter     | Notes     |
|---|-----------------|--------------------|-----|---------------|-----------|
|   | 00:01.535 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_thomaswue.java)| 21.0.2-graal | [Thomas Wuerthinger](https://github.com/thomaswue) | GraalVM native binary, uses Unsafe |

Feb 01 '24 09:02 gunnarmorling

@thomaswue congratulations!

Feb 01 '24 09:02 jerrinot

Wow, cool, thank you, that is even faster on the evaluation server than expected! Nice to see how the solution can very much compete while keeping the core Result data structure in the style of a Java class.

As referenced in the credit and co-author section, I got a lot of inspiration and borrowed ideas from @merykitty, @mukel, @artsiomkorzun, @jerrinot, and @abeobk. Thank you guys, it was a blast, I enjoyed it a lot (as you could see)! And I also learned a lot. Some of those learnings will go into future Graal compiler optimizations and tunings for sure!

Feb 01 '24 10:02 thomaswue

which translates in: know the pattern of your data, your domain, and you can get an immensely faster program. It would be nice to see now, after the rightly applied tradeoffs how an high-level java program could perform now, without relying on Unsafe

ie the right stride to parse, branchless parsing, decent locality (without Unsafe)

Feb 01 '24 10:02 franz1981

It should be possible to go without unsafe and without performance loss, but it will require some really careful crafting to make sure all bounds checks are combined/moved to mitigate their effect. The memory API also has a few more checks than just the bounds (thread owner, scope, temporal check). I tried to use it and looked at the compiler graph.

We will see if Graal compiler can optimize those specially maybe or if some API additions might be necessary.

The other aspect we will add to Graal for sure is auto-vectorization of the search loop. That would avoid all that SWAR/mask complexity and produce decent vector code for the search automatically. We do that for the String indexOf methods already, but not for general user code.

Feb 01 '24 10:02 thomaswue

The other aspect we will add to Graal for sure is auto-vectorization of the search loop

that is indeed a very nice thing which C2 can achieve, sometime, under certain conditions (which is an interesting combination of words :P ), but I still believe that having an intrinsic for a declarative MemorySegment::indexOf(byte) API to be a good idea as well, unless the autovectorization can always make it happen, regardless conditions like inlining-level etc etc There are many cases where such things are needed (I've implemented manually SWAR on Netty's off-heap/on-heap exactly because I couldn't rely on ANY compiler to do int right, always), and I think others in the java middleware data-processing landscape would benefit about it (ElasticSearch? ).

Feb 01 '24 10:02 franz1981

Yes, I think the best approach is to implement it as a general compiler transform and offer optional utility methods that if used are guaranteed to trigger the optimization.

Feb 01 '24 10:02 thomaswue

Obviously, missed branches are bad. But what I did not expect is how bad they are. There are 500 million missed branches removed (which makes sense, because basically every 2nd row is a branch miss in the 8-byte vs 16-byte separate case version). The number of instructions for getting rid of them goes up by about 17 billion (!) and the new version is still more than 20% less cycles.

That's very interesting!!!

The number of instructions vs branch misses is probably one of the reasons why some results are very different depending on the HW.

Also, OS configuration might play a role, because of the patches against meltdown and spectre vulnerabilities.

Feb 01 '24 12:02 tivrfoa

Hey @thomaswue, @merykitty, and @mukel!

Congrats again on being in the Top 3 of the One Billion Row Challenge!

To celebrate this amazing achievement, I would like to send you a 1BRC t-shirt and coffee mug. To claim your prize, fill out this form by Feb 18. After submitting the form, please provide a comment with the random value you've specified in the form, so that I know it is you who submitted it.

All data entered will solely be used in relation to processing this shipment. Shipments can be sent to any country listed here. A big thank you to Decodable for sponsoring these prizes!

Thanks a lot for participating in 1BRC,

--Gunnar

Feb 11 '24 11:02 gunnarmorling

Hey @mtopolnik, thx for replying. I've tagged you on your own PR with the right link to the Top 20 form. Nice try to get that Top 3 t-shirt though ;)

Feb 11 '24 12:02 gunnarmorling

Oops... I got notifications for a bunch of them and had all of them open in the browser.

Feb 11 '24 12:02 mtopolnik

Thank you so much @gunnarmorling! Looking forward to proudly wear that #1brc shirt in some of my next talks ;-).

Number is 9597 9145

Feb 11 '24 22:02 thomaswue

Hi @gunnarmorling, my random tokens: 3728 28670

Feb 13 '24 05:02 mukel

1brc 1brc copied to clipboard

One last improvement for thomaswue

Check List:

1brc
1brc copied to clipboard