1brc icon indicating copy to clipboard operation
1brc copied to clipboard

More UNSAFE, flyweight finally faster

Open royvanrijn opened this issue 1 year ago • 5 comments

For some reason, suddenly, this flyweight-with-byte[] version is now faster on my machine...

I had the local branch for a couple of days where I no longer have an Entry class, just a flyweight byte[]s. I'm basically doing C now. Until the small tweaks this morning it was slower, but updating with the other minor tweaks, it is now running faster.

Also: The scores seem to very a lot from machine to machine, now I'm just guessing what will work on the target machine.

If I run the top three locally on my MacBook Pro M2 with my latest code, I get the following reversed order:

# Result (m:s.ms) Implementation JDK Submitter Notes
00:01.276 link 21.0.1-graal Roy van Rijn GraalVM native binary
00:01.391 link 21.0.1-graal Thomas Wuerthinger GraalVM native binary
00:02.016 link 21.0.1-open Quan Anh Mai

So improving is a bit hard because I'm basically shooting in the dark and hoping things improve on the target machine as well...

Check List:

  • [X] Tests pass (./test.sh <username> shows no differences between expected and actual outputs)
  • [X] All formatting changes by the build are committed
  • [X] Your launch script is named calculate_average_<username>.sh (make sure to match casing of your GH user name) and is executable
  • [X] Output matches that of calculate_average_baseline.sh
  • Execution time: 00:01.260
  • Execution time of reference implementation: -

royvanrijn avatar Jan 12 '24 22:01 royvanrijn

Also: The scores seem to very a lot from machine to machine, now I'm just guessing what will work on the target machine.

If I run the top three locally on my MacBook Pro M2 with my latest code, I get the following reversed order: (...) So improving is a bit hard because I'm basically shooting in the dark and hoping things improve on the target machine as well...

Apple Silicon has 128-bit SIMD (ARM NEON), while Zen 2 has 256-bit SIMD (AVX2), so that can affect performance a lot if some implementations assume that values fit into SIMD register almost always and have slow fallback routines.

tarsa avatar Jan 13 '24 00:01 tarsa

It seems to be a tiny bit faster, but variance is higher than before:

Benchmark 1: timeout -v 300 ./calculate_average_royvanrijn.sh 2>&1
  Time (mean ± σ):      2.935 s ±  0.312 s    [User: 19.625 s, System: 0.692 s]
  Range (min … max):    2.778 s …  3.579 s    10 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  royvanrijn: trimmed mean 2.873705197275, raw times 2.7779565444,3.4680600193999997,2.8028195914,2.7799835364,2.7803648753999997,2.7854204734,2.7788515914,2.8128272034,2.7813142874,3.5793960354

Leaderboard

| # | Result (m:s.ms) | Implementation     | JDK | Submitter     | Notes     |
|---|-----------------|--------------------|-----|---------------|-----------|
|   | 00:02.873 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java)| 21.0.1-graal | [Roy van Rijn](https://github.com/royvanrijn) | GraalVM native binary |

I also ran perf stat. Here's what I get for your entry as per current main (note this with 32 cores):

 Performance counter stats for './calculate_average_royvanrijn.sh':

         21,074.38 msec task-clock:u                     #   20.841 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
           224,783      page-faults:u                    #   10.666 K/sec
    49,430,931,263      cycles:u                         #    2.346 GHz                         (83.32%)
       545,057,937      stalled-cycles-frontend:u        #    1.10% frontend cycles idle        (83.37%)
    21,012,068,419      stalled-cycles-backend:u         #   42.51% backend cycles idle         (83.33%)
   130,952,310,122      instructions:u                   #    2.65  insn per cycle
                                                  #    0.16  stalled cycles per insn     (83.33%)
    13,220,466,714      branches:u                       #  627.324 M/sec                       (83.33%)
       531,237,874      branch-misses:u                  #    4.02% of all branches             (83.32%)

       1.011222453 seconds time elapsed

      19.826243000 seconds user
       0.804037000 seconds sys

And that's it for this PR:

 Performance counter stats for './calculate_average_royvanrijn.sh':

         20,716.55 msec task-clock:u                     #   21.461 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
           225,154      page-faults:u                    #   10.868 K/sec
    48,455,951,511      cycles:u                         #    2.339 GHz                         (83.37%)
       538,780,303      stalled-cycles-frontend:u        #    1.11% frontend cycles idle        (83.35%)
    24,334,557,609      stalled-cycles-backend:u         #   50.22% backend cycles idle         (83.30%)
   132,469,719,208      instructions:u                   #    2.73  insn per cycle
                                                  #    0.18  stalled cycles per insn     (83.34%)
    10,193,140,558      branches:u                       #  492.029 M/sec                       (83.33%)
       530,907,870      branch-misses:u                  #    5.21% of all branches             (83.37%)

       0.965325877 seconds time elapsed

      19.405281000 seconds user
       0.860136000 seconds sys

gunnarmorling avatar Jan 13 '24 16:01 gunnarmorling

I've noticed the same, when I run with more Threads it seems to be a tiny bit better, let's try that. They have less of an issue that everybody is waiting, there is more work.... but also more memory.

I'd love to implement something like work-stealing, but I fail to see how...

@gunnarmorling I've pushed a different version, does that help or hurt?

royvanrijn avatar Jan 13 '24 17:01 royvanrijn

@royvanrijn One thing I found is that in order for the UNSAFE calls to be at maximum efficiency, you should initialize the class with the UNSAFE field at build time. So in your case it should be "--initialize-at-build-time=dev.morling.onebrc.CalculateAverage_royvanrijn". The problem is that otherwise there is a check of this field against null in the compiled code (to adhere to Java semantics). It isn't making that much of a difference for my solution, but given your high instruction per cycle count, it could be more noticeable for yours. Probably also relevant for @artsiomkorzun.

thomaswue avatar Jan 14 '24 18:01 thomaswue

Very nice tip @thomaswue; these things are obvious when you know them but I’m mostly oblivious initially.

I’m really curious about the code I’m currently working on; it’s probably not worth it, not faster, but it’ll be novel at least. I’m playing around with work-sharing, which would really help on my laptop; probably won’t make any positive impact on the server with extra cores.

But we haven’t seen any real changes in the algorithms, every body just splits in equal chunks and hope for the best.

royvanrijn avatar Jan 14 '24 20:01 royvanrijn