1brc
1brc copied to clipboard
More UNSAFE, flyweight finally faster
For some reason, suddenly, this flyweight-with-byte[] version is now faster on my machine...
I had the local branch for a couple of days where I no longer have an Entry class, just a flyweight byte[]s. I'm basically doing C now. Until the small tweaks this morning it was slower, but updating with the other minor tweaks, it is now running faster.
Also: The scores seem to very a lot from machine to machine, now I'm just guessing what will work on the target machine.
If I run the top three locally on my MacBook Pro M2 with my latest code, I get the following reversed order:
# | Result (m:s.ms) | Implementation | JDK | Submitter | Notes |
---|---|---|---|---|---|
00:01.276 | link | 21.0.1-graal | Roy van Rijn | GraalVM native binary | |
00:01.391 | link | 21.0.1-graal | Thomas Wuerthinger | GraalVM native binary | |
00:02.016 | link | 21.0.1-open | Quan Anh Mai |
So improving is a bit hard because I'm basically shooting in the dark and hoping things improve on the target machine as well...
Check List:
- [X] Tests pass (
./test.sh <username>
shows no differences between expected and actual outputs) - [X] All formatting changes by the build are committed
- [X] Your launch script is named
calculate_average_<username>.sh
(make sure to match casing of your GH user name) and is executable - [X] Output matches that of
calculate_average_baseline.sh
- Execution time:
00:01.260
- Execution time of reference implementation: -
Also: The scores seem to very a lot from machine to machine, now I'm just guessing what will work on the target machine.
If I run the top three locally on my MacBook Pro M2 with my latest code, I get the following reversed order: (...) So improving is a bit hard because I'm basically shooting in the dark and hoping things improve on the target machine as well...
Apple Silicon has 128-bit SIMD (ARM NEON), while Zen 2 has 256-bit SIMD (AVX2), so that can affect performance a lot if some implementations assume that values fit into SIMD register almost always and have slow fallback routines.
It seems to be a tiny bit faster, but variance is higher than before:
Benchmark 1: timeout -v 300 ./calculate_average_royvanrijn.sh 2>&1
Time (mean ± σ): 2.935 s ± 0.312 s [User: 19.625 s, System: 0.692 s]
Range (min … max): 2.778 s … 3.579 s 10 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Summary
royvanrijn: trimmed mean 2.873705197275, raw times 2.7779565444,3.4680600193999997,2.8028195914,2.7799835364,2.7803648753999997,2.7854204734,2.7788515914,2.8128272034,2.7813142874,3.5793960354
Leaderboard
| # | Result (m:s.ms) | Implementation | JDK | Submitter | Notes |
|---|-----------------|--------------------|-----|---------------|-----------|
| | 00:02.873 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java)| 21.0.1-graal | [Roy van Rijn](https://github.com/royvanrijn) | GraalVM native binary |
I also ran perf stat
. Here's what I get for your entry as per current main (note this with 32 cores):
Performance counter stats for './calculate_average_royvanrijn.sh':
21,074.38 msec task-clock:u # 20.841 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
224,783 page-faults:u # 10.666 K/sec
49,430,931,263 cycles:u # 2.346 GHz (83.32%)
545,057,937 stalled-cycles-frontend:u # 1.10% frontend cycles idle (83.37%)
21,012,068,419 stalled-cycles-backend:u # 42.51% backend cycles idle (83.33%)
130,952,310,122 instructions:u # 2.65 insn per cycle
# 0.16 stalled cycles per insn (83.33%)
13,220,466,714 branches:u # 627.324 M/sec (83.33%)
531,237,874 branch-misses:u # 4.02% of all branches (83.32%)
1.011222453 seconds time elapsed
19.826243000 seconds user
0.804037000 seconds sys
And that's it for this PR:
Performance counter stats for './calculate_average_royvanrijn.sh':
20,716.55 msec task-clock:u # 21.461 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
225,154 page-faults:u # 10.868 K/sec
48,455,951,511 cycles:u # 2.339 GHz (83.37%)
538,780,303 stalled-cycles-frontend:u # 1.11% frontend cycles idle (83.35%)
24,334,557,609 stalled-cycles-backend:u # 50.22% backend cycles idle (83.30%)
132,469,719,208 instructions:u # 2.73 insn per cycle
# 0.18 stalled cycles per insn (83.34%)
10,193,140,558 branches:u # 492.029 M/sec (83.33%)
530,907,870 branch-misses:u # 5.21% of all branches (83.37%)
0.965325877 seconds time elapsed
19.405281000 seconds user
0.860136000 seconds sys
I've noticed the same, when I run with more Threads it seems to be a tiny bit better, let's try that. They have less of an issue that everybody is waiting, there is more work.... but also more memory.
I'd love to implement something like work-stealing, but I fail to see how...
@gunnarmorling I've pushed a different version, does that help or hurt?
@royvanrijn One thing I found is that in order for the UNSAFE calls to be at maximum efficiency, you should initialize the class with the UNSAFE field at build time. So in your case it should be "--initialize-at-build-time=dev.morling.onebrc.CalculateAverage_royvanrijn". The problem is that otherwise there is a check of this field against null in the compiled code (to adhere to Java semantics). It isn't making that much of a difference for my solution, but given your high instruction per cycle count, it could be more noticeable for yours. Probably also relevant for @artsiomkorzun.
Very nice tip @thomaswue; these things are obvious when you know them but I’m mostly oblivious initially.
I’m really curious about the code I’m currently working on; it’s probably not worth it, not faster, but it’ll be novel at least. I’m playing around with work-sharing, which would really help on my laptop; probably won’t make any positive impact on the server with extra cores.
But we haven’t seen any real changes in the algorithms, every body just splits in equal chunks and hope for the best.