1brc icon indicating copy to clipboard operation
1brc copied to clipboard

merykitty gave in to Unsafe

Open merykitty opened this issue 1 year ago • 4 comments

Check List:

  • [x] Tests pass (./test.sh <username> shows no differences between expected and actual outputs)
  • [x] All formatting changes by the build are committed
  • [ ] Your launch script is named calculate_average_<username>.sh (make sure to match casing of your GH user name) and is executable
  • [x] Output matches that of calculate_average_baseline.sh
  • Execution time of merykittyunsafe:
 Performance counter stats for 'sh calculate_average_merykittyunsafe.sh':

          13497.35 msec task-clock:u              #   10.750 CPUs utilized
                 0      context-switches:u        #    0.000 /sec
                 0      cpu-migrations:u          #    0.000 /sec
            227100      page-faults:u             #   16.826 K/sec
       60988772290      cycles:u                  #    4.519 GHz
         289560740      stalled-cycles-frontend:u #    0.47% frontend cycles idle
         114022556      stalled-cycles-backend:u  #    0.19% backend cycles idle
      114074745602      instructions:u            #    1.87  insn per cycle
                                                  #    0.00  stalled cycles per insn
       12762459789      branches:u                #  945.553 M/sec
          53292216      branch-misses:u           #    0.42% of all branches

       1.255587212 seconds time elapsed

      12.568676000 seconds user
       0.870567000 seconds sys
  • Execution time of merykitty:
 Performance counter stats for 'sh calculate_average_merykitty.sh':

          15482.62 msec task-clock:u              #   11.502 CPUs utilized
                 0      context-switches:u        #    0.000 /sec
                 0      cpu-migrations:u          #    0.000 /sec
            222547      page-faults:u             #   14.374 K/sec
       71001812851      cycles:u                  #    4.586 GHz
         362493890      stalled-cycles-frontend:u #    0.51% frontend cycles idle
         130886807      stalled-cycles-backend:u  #    0.18% backend cycles idle
      143777012786      instructions:u            #    2.02  insn per cycle
                                                  #    0.00  stalled cycles per insn
       23320869203      branches:u                #    1.506 G/sec
          45042342      branch-misses:u           #    0.19% of all branches

       1.346125882 seconds time elapsed

      14.487117000 seconds user
       0.916624000 seconds sys

Not sure if this is allowed, I have consciously avoided using Unsafe, and this is my attempt to truly push the challenge to its limit.

merykitty avatar Jan 11 '24 21:01 merykitty

The code consistently outpaces the solutions of Thomas and Roy in C2, but I don't think it will do so after taking into consideration AOT vs JIT and after staring at the assembly for 3 hours I have not come up with any ideas.

merykitty avatar Jan 11 '24 21:01 merykitty

after staring at the assembly for 3 hours

I'm starting to feel guilty about all this ;)

gunnarmorling avatar Jan 11 '24 21:01 gunnarmorling

@merykitty How much slower does it perform with aot?

artsiomkorzun avatar Jan 11 '24 21:01 artsiomkorzun

@artsiomkorzun It is about 8 times slower with Graal, I do not have a Graal built with hsdis at hand but looking at the code comments:

  0x00007fadf73244c8: ; ImmutableOopMap {rax=Oop r10=Oop r11=Oop [16]=Oop [24]=Oop }
                      ;*iinc {reexecute=1 rethrow=0 return_oop=0}
                      ; - (reexecute) jdk.incubator.vector.ByteVector::ldLongOp@35 (line 368)
                      ; - jdk.incubator.vector.ByteVector$ByteSpecies::ldLongOp@8 (line 4262)
                      ; - jdk.incubator.vector.ByteVector::lambda$fromMemorySegment0Template$105@8 (line 3811)
                      ; - jdk.incubator.vector.ByteVector$$Lambda/0x00007fad9c05e900::load@10
                      ; - jdk.internal.vm.vector.VectorSupport::load@32 (line 428)
                      ; - jdk.internal.misc.ScopedMemoryAccess::loadFromMemorySegmentScopedInternal@28 (line 361)
                      ; - jdk.internal.misc.ScopedMemoryAccess::loadFromMemorySegment@31 (line 338)
                      ; - jdk.incubator.vector.ByteVector::fromMemorySegment0Template@33 (line 3807)
                      ; - jdk.incubator.vector.Byte256Vector::fromMemorySegment0@3 (line 938)
                      ; - jdk.incubator.vector.ByteVector::fromMemorySegment@31 (line 3297)
                      ; - dev.morling.onebrc.CalculateAverage_merykittyunsafe::iterate@37 (line 256)

I assume that jdk.internal.vm.vector.VectorSupport::load is not intrinsified properly, some more other vector intrinsics seem to express the same behaviour.

merykitty avatar Jan 11 '24 21:01 merykitty

@merykitty Yes, we are aware of this issue with the Vector API. Unfortunately it still changes quite often when in incubator phase and it is difficult for us to keep up with the intrinsifications, which for this kind of low-level compiler API are essential.

thomaswue avatar Jan 12 '24 02:01 thomaswue

It doesn't seems to be a clear improvement on the eval machine for some reason. Here's the results from 10 runs:

Benchmark 1: timeout -v 300 ./calculate_average_merykitty.sh 2>&1
  Time (mean ± σ):      3.268 s ±  0.127 s    [User: 21.568 s, System: 0.741 s]
  Range (min … max):    2.953 s …  3.430 s    10 runs

Summary
  merykitty: trimmed mean 3.286730414775, raw times 3.1829055004,2.9527902024,3.3237153764,3.3429483424,3.2571899574,3.4296059994,3.2995724264,3.2947934074000003,3.3001821494000003,3.2925361584000004

Leaderboard

| # | Result (m:s.ms) | Implementation     | JDK | Submitter     | Notes     |
|---|-----------------|--------------------|-----|---------------|-----------|
|   | 00:03.286 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_merykitty.java)| 21.0.1-open | [Quan Anh Mai](https://github.com/merykitty) |  |

gunnarmorling avatar Jan 12 '24 09:01 gunnarmorling

@gunnarmorling It is a separate entry merykittyunsafe instead of merykitty. I don't really want to add Unsafe to my original submission so can this go as a separate entry? Thanks.

merykitty avatar Jan 12 '24 10:01 merykitty

@gunnarmorling It is a separate entry merykittyunsafe instead of merykitty.

Ah, sorry, had missed that. Oh, boy 🤯 :

Benchmark 1: timeout -v 300 ./calculate_average_merykittyunsafe.sh 2>&1
  Time (mean ± σ):      2.573 s ±  0.024 s    [User: 15.982 s, System: 0.749 s]
  Range (min … max):    2.521 s …  2.605 s    10 runs

Summary
  merykittyunsafe: trimmed mean 2.575439155045, raw times 2.57165876642,2.57405251042,2.55458195942,2.5576553414200003,2.52133204842,2.5802883164200003,2.59440042542,2.58838683042,2.60452826242,2.58248909042

Leaderboard

| # | Result (m:s.ms) | Implementation     | JDK | Submitter     | Notes     |
|---|-----------------|--------------------|-----|---------------|-----------|
|   | 00:02.575 | [link](https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_merykittyunsafe.java)| 21.0.1-open | [merykittyunsafe](https://github.com/merykittyunsafe) |  |

I don't really want to add Unsafe to my original submission so can this go as a separate entry? Thanks.

Yes, we can do that. While there should be only one entry per participant by default, I am happy to make an exception for this case.

gunnarmorling avatar Jan 12 '24 11:01 gunnarmorling

@gunnarmorling Wow that is really impressive, I guess the test machine is much more capable at utilising the reduction in instruction count than mine. Thanks a lot for your help.

merykitty avatar Jan 12 '24 11:01 merykitty

@merykitty Really cool vectorized solution!

On my machine utilizing all 32 CPU cores the solution becomes memory-bound, but I think given that the evaluation is done only on 8 cores out of the 64, the reduction in instructions is more important and giving the larger speed-up.

thomaswue avatar Jan 12 '24 18:01 thomaswue

Strangely enough, both merykitty and merykittyunsafe are very slow on the old Hetzner CCX33. 45 and 40 seconds, respectively (on the official dataset). On the same instance, royvanrijn is at 4.7 seconds. I know the new instance doesn't have AVX-512 support, either, so I wonder what should explain this.

mtopolnik avatar Jan 12 '24 18:01 mtopolnik

@mtopolnik that's why, in some extent I personally prefer SWAR, when applicable, cause it works the same across archs, although it add some register pressure. Peak performance is obvious not comparable...

franz1981 avatar Jan 12 '24 18:01 franz1981

@mtopolnik Are you running with C2 or with Graal? Because the latter has not caught up with the development of the Vector API yet.

merykitty avatar Jan 12 '24 18:01 merykitty

Yes, @mtopolnik the numbers there seem to indicate you were running with Graal JIT with the missing intrinsifications for the incubator Vector API.

thomaswue avatar Jan 12 '24 19:01 thomaswue

Didn't btw know about "-Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0" yet. Is this planned to stay and be the new unsafe ;-) ?

thomaswue avatar Jan 12 '24 19:01 thomaswue

@mtopolnik Are you running with C2 or with Graal? Because the latter has not caught up with the development of the Vector API yet.

Oops, I have a script that calls prepare_author.sh and then calculate_average_author.sh. But now I see there's no prepare_merykitty*.sh.

mtopolnik avatar Jan 13 '24 08:01 mtopolnik