1brc
1brc copied to clipboard
merykitty gave in to Unsafe
Check List:
- [x] Tests pass (
./test.sh <username>
shows no differences between expected and actual outputs) - [x] All formatting changes by the build are committed
- [ ] Your launch script is named
calculate_average_<username>.sh
(make sure to match casing of your GH user name) and is executable - [x] Output matches that of
calculate_average_baseline.sh
- Execution time of
merykittyunsafe
:
Performance counter stats for 'sh calculate_average_merykittyunsafe.sh':
13497.35 msec task-clock:u # 10.750 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
227100 page-faults:u # 16.826 K/sec
60988772290 cycles:u # 4.519 GHz
289560740 stalled-cycles-frontend:u # 0.47% frontend cycles idle
114022556 stalled-cycles-backend:u # 0.19% backend cycles idle
114074745602 instructions:u # 1.87 insn per cycle
# 0.00 stalled cycles per insn
12762459789 branches:u # 945.553 M/sec
53292216 branch-misses:u # 0.42% of all branches
1.255587212 seconds time elapsed
12.568676000 seconds user
0.870567000 seconds sys
- Execution time of
merykitty
:
Performance counter stats for 'sh calculate_average_merykitty.sh':
15482.62 msec task-clock:u # 11.502 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
222547 page-faults:u # 14.374 K/sec
71001812851 cycles:u # 4.586 GHz
362493890 stalled-cycles-frontend:u # 0.51% frontend cycles idle
130886807 stalled-cycles-backend:u # 0.18% backend cycles idle
143777012786 instructions:u # 2.02 insn per cycle
# 0.00 stalled cycles per insn
23320869203 branches:u # 1.506 G/sec
45042342 branch-misses:u # 0.19% of all branches
1.346125882 seconds time elapsed
14.487117000 seconds user
0.916624000 seconds sys
Not sure if this is allowed, I have consciously avoided using Unsafe
, and this is my attempt to truly push the challenge to its limit.
The code consistently outpaces the solutions of Thomas and Roy in C2, but I don't think it will do so after taking into consideration AOT vs JIT and after staring at the assembly for 3 hours I have not come up with any ideas.
after staring at the assembly for 3 hours
I'm starting to feel guilty about all this ;)
@merykitty How much slower does it perform with aot?
@artsiomkorzun It is about 8 times slower with Graal, I do not have a Graal built with hsdis at hand but looking at the code comments:
0x00007fadf73244c8: ; ImmutableOopMap {rax=Oop r10=Oop r11=Oop [16]=Oop [24]=Oop }
;*iinc {reexecute=1 rethrow=0 return_oop=0}
; - (reexecute) jdk.incubator.vector.ByteVector::ldLongOp@35 (line 368)
; - jdk.incubator.vector.ByteVector$ByteSpecies::ldLongOp@8 (line 4262)
; - jdk.incubator.vector.ByteVector::lambda$fromMemorySegment0Template$105@8 (line 3811)
; - jdk.incubator.vector.ByteVector$$Lambda/0x00007fad9c05e900::load@10
; - jdk.internal.vm.vector.VectorSupport::load@32 (line 428)
; - jdk.internal.misc.ScopedMemoryAccess::loadFromMemorySegmentScopedInternal@28 (line 361)
; - jdk.internal.misc.ScopedMemoryAccess::loadFromMemorySegment@31 (line 338)
; - jdk.incubator.vector.ByteVector::fromMemorySegment0Template@33 (line 3807)
; - jdk.incubator.vector.Byte256Vector::fromMemorySegment0@3 (line 938)
; - jdk.incubator.vector.ByteVector::fromMemorySegment@31 (line 3297)
; - dev.morling.onebrc.CalculateAverage_merykittyunsafe::iterate@37 (line 256)
I assume that jdk.internal.vm.vector.VectorSupport::load
is not intrinsified properly, some more other vector intrinsics seem to express the same behaviour.
@merykitty Yes, we are aware of this issue with the Vector API. Unfortunately it still changes quite often when in incubator phase and it is difficult for us to keep up with the intrinsifications, which for this kind of low-level compiler API are essential.
It doesn't seems to be a clear improvement on the eval machine for some reason. Here's the results from 10 runs:
Benchmark 1: timeout -v 300 ./calculate_average_merykitty.sh 2>&1
Time (mean ± σ): 3.268 s ± 0.127 s [User: 21.568 s, System: 0.741 s]
Range (min … max): 2.953 s … 3.430 s 10 runs
Summary
merykitty: trimmed mean 3.286730414775, raw times 3.1829055004,2.9527902024,3.3237153764,3.3429483424,3.2571899574,3.4296059994,3.2995724264,3.2947934074000003,3.3001821494000003,3.2925361584000004
Leaderboard
| # | Result (m:s.ms) | Implementation | JDK | Submitter | Notes |
|---|-----------------|--------------------|-----|---------------|-----------|
| | 00:03.286 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_merykitty.java)| 21.0.1-open | [Quan Anh Mai](https://github.com/merykitty) | |
@gunnarmorling It is a separate entry merykittyunsafe
instead of merykitty
. I don't really want to add Unsafe
to my original submission so can this go as a separate entry? Thanks.
@gunnarmorling It is a separate entry merykittyunsafe instead of merykitty.
Ah, sorry, had missed that. Oh, boy 🤯 :
Benchmark 1: timeout -v 300 ./calculate_average_merykittyunsafe.sh 2>&1
Time (mean ± σ): 2.573 s ± 0.024 s [User: 15.982 s, System: 0.749 s]
Range (min … max): 2.521 s … 2.605 s 10 runs
Summary
merykittyunsafe: trimmed mean 2.575439155045, raw times 2.57165876642,2.57405251042,2.55458195942,2.5576553414200003,2.52133204842,2.5802883164200003,2.59440042542,2.58838683042,2.60452826242,2.58248909042
Leaderboard
| # | Result (m:s.ms) | Implementation | JDK | Submitter | Notes |
|---|-----------------|--------------------|-----|---------------|-----------|
| | 00:02.575 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_merykittyunsafe.java)| 21.0.1-open | [merykittyunsafe](https://github.com/merykittyunsafe) | |
I don't really want to add Unsafe to my original submission so can this go as a separate entry? Thanks.
Yes, we can do that. While there should be only one entry per participant by default, I am happy to make an exception for this case.
@gunnarmorling Wow that is really impressive, I guess the test machine is much more capable at utilising the reduction in instruction count than mine. Thanks a lot for your help.
@merykitty Really cool vectorized solution!
On my machine utilizing all 32 CPU cores the solution becomes memory-bound, but I think given that the evaluation is done only on 8 cores out of the 64, the reduction in instructions is more important and giving the larger speed-up.
Strangely enough, both merykitty and merykittyunsafe are very slow on the old Hetzner CCX33. 45 and 40 seconds, respectively (on the official dataset). On the same instance, royvanrijn is at 4.7 seconds. I know the new instance doesn't have AVX-512 support, either, so I wonder what should explain this.
@mtopolnik that's why, in some extent I personally prefer SWAR, when applicable, cause it works the same across archs, although it add some register pressure. Peak performance is obvious not comparable...
@mtopolnik Are you running with C2 or with Graal? Because the latter has not caught up with the development of the Vector API yet.
Yes, @mtopolnik the numbers there seem to indicate you were running with Graal JIT with the missing intrinsifications for the incubator Vector API.
Didn't btw know about "-Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0" yet. Is this planned to stay and be the new unsafe ;-) ?
@mtopolnik Are you running with C2 or with Graal? Because the latter has not caught up with the development of the Vector API yet.
Oops, I have a script that calls prepare_author.sh
and then calculate_average_author.sh
. But now I see there's no prepare_merykitty*.sh
.