1brc
1brc copied to clipboard
First attempt from hundredwatt
Check List:
- [x] Tests pass (
./test.sh <username>
shows no differences between expected and actual outputs) - [x] All formatting changes by the build are committed
- [x] Your launch script is named
calculate_average_<username>.sh
(make sure to match casing of your GH user name) and is executable - [x] Output matches that of
calculate_average_baseline.sh
- Execution time:
00:05.318
(8 Core Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz) - Execution time of reference implementation:
3:18.572
Was distracted by making the eval script perfect 😂, so didn't spend as much time on this as I wanted, but excited to join the mix!
Main thing:
- Gunnar said we couldn't use perfect hash tables for city names early on, but no one said anything about using them for temperature values 😉 . With only 1,999 possible values, we can construct a naive perfect hash table (which has terrible construction time, but the best possible lookup time) for temperature values that's only 5KB.
Also:
- Bit twiddling to find separators and mask words
- Read from file only using longs (
MBB.getLong()
) - Reading the file in stripes instead of partitions - implemented this, but didn't analyze yet... planning to play with this more
- Using
21.0.1-graal
because it's all over the top of the leaderboard 😄
Re-ran evaluation on an 8-core cloud VM:
- Execution time: 00:05.318 (8 Core Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz)
- Execution time of reference implementation: 3:18.572
It's getting late here and I'm tired, so sorry for the questions: how/where does it deal with collisions for the name hash, and what is the purpose of that value hash table? Thx for clarifying.
@gunnarmorling
For city names, I use linear probing similar to other entires:
https://github.com/gunnarmorling/1brc/blob/8bb24ec4f631a635f11f91f26c63246e14bb79f9/src/main/java/dev/morling/onebrc/CalculateAverage_hundredwatt.java#L163-L168
For temperature values, I realized there's only 1,999 possible values based on the rules, so I created a lookup table that uses a perfect hash function to map the raw bytes from the file to the corresponding temperature value integer.
I used a naive perfect hash function of the form temperature_long * SEED % slots
with SEED chosen offline via brute force such that there are no collisions amongst the possible temperature values.
Here are some examples of parsing temperature values using the lookup table:
File Contents | mbb.getLong() | Bit Twiddling to remove '\n' and next line characters | perfect hash function to calculate lookup table location Math.abs(temperature_long * seed) % slots |
Read value at hash index from TEMPERATURES array |
---|---|---|---|---|
"-67.8\nHa" | 0x2d36372e380a4861 | 0x2d36372e38000000 | 2415 | -678 |
"31.4\nCha" | 0x33312e340a436861 | 0x33312e3400000000 | 3823 | 314 |
This method parses the temperature value with fewer instructions than other methods and no branching. It does require a memory lookup, but since the lookup table is 10KB, it should always reside in L1 or L2 cache.
Here are the important code snippets:
Compute all possible temperature strings and encode them as long words
https://github.com/gunnarmorling/1brc/blob/8bb24ec4f631a635f11f91f26c63246e14bb79f9/src/main/java/dev/morling/onebrc/CalculateAverage_hundredwatt.java#L44-L75
Construct the lookup table mapping all temperature strings as input to the parsed integer value
https://github.com/gunnarmorling/1brc/blob/8bb24ec4f631a635f11f91f26c63246e14bb79f9/src/main/java/dev/morling/onebrc/CalculateAverage_hundredwatt.java#L102-L109
Read temperature value long word from file, bit twiddle to remove \n
and trailing characters, then use lookup table to find integer value
https://github.com/gunnarmorling/1brc/blob/8bb24ec4f631a635f11f91f26c63246e14bb79f9/src/main/java/dev/morling/onebrc/CalculateAverage_hundredwatt.java#L264-L277
@hundredwatt, thanks a lot for this extensive explanation 🙏! Really, really cool stuff. Comes in at 00:04.418 on the new machine, i.e. very close to the top. Nice one!