1brc
1brc copied to clipboard
1BRC in .NET among fastest on Linux
1️⃣🐝🏎️ The One Billion Row Challenge
.NET implementation of https://github.com/gunnarmorling/1brc
This implementation supports \r\n
line endings. The numbers in the Evolution section below are from Windows, where the input file is 13.7GB vs 12.8GB. It's generated by the upstream Java code on Windows so this is an implicit requirement for x-plat already.
Results
The performance is measured on an idle 6C/12T Alder Lake CPU fixed at 2.5 GHz (no turbo), 32GB DDR4 3200, Debian 12 in LXC. At least 5 runs, showing the best result.
Note: results are very stable, usually only the 2nd decimal changes between the runs.
As of Jan 7, 20:30 UTC
.NET
№ | JIT | AOT | Code | Runtime | Submitter |
---|---|---|---|---|---|
1. | 00:03.971 | 00:03.725 | THIS REPO | linux-x64 | Victor Baybekov |
2. | 00:04.524 | 00:04.668 | link | linux-x64 | Frank A. Krueger |
3. | 00:05.979 | 00:06.657 | link | linux-x64 | Pedro Travi |
4. | 00:08.079 | 00:08.589 | link | linux-x64 | Fabien Barbier |
Java
№ | JIT | AOT | Code | Runtime | Submitter |
---|---|---|---|---|---|
1. | 00:03.108 | ✖️ | link | 21.0.1-graal | Roy van Rijn |
2. | ✖️ | 00:03.558 | link | 21.0.1-graal | Thomas Wuerthinger |
3. | 00:04.128 | ✖️ | link | 21.0.1-open | Quan Anh Mai |
~ | 03:28.764 | ✖️ | link (base) | 21.0.1-open | Gunnar Morling |
Native
№ | AOT | Code | Language | Submitter | Details |
---|---|---|---|---|---|
?. | 00:02.644 | link | C++ | lehuyduc | |
?.* | 00:03.902 | link | C | Danny van Kooten | blog |
* "under 2 seconds on my not-so-fast AMD Ryzen 4800U laptop CPU" sounds cool, but the laptop has 8C/16T with 4.2 GHz frequency. Let's say it's throttled a little and is at 4 GHz effectively. My benchmark machine is 6 x 2.5 = 15 GHz, that CPU is 8 * 4 = 32 GHz equivalent per die, or 2x times faster. The math checks out.
Evolution
Below is the evolution of results with each commit. The time shown here is measured inside the app, on Windows. Not comparable with time
command (startup/shutdown time is not measured).
First attempt
Mmap + paralell using Span API and some unsafe tricks to avoid Utf8 parsing until the very end.
Processed in 00:00:10.6978618
Processed in 00:00:10.8473143
Processed in 00:00:10.9107262
Processed in 00:00:10.9733218
Processed in 00:00:10.5854176
Some micro optimizations
Processed in 00:00:09.7093471
Float parsing is ~57%, dictionary lookup is ~24%. Optimizing further is about those two things. We may use csFastFloat
library and a specialized dictionary such as DictionarySlim
. However the goal is to avoid dependencies even if they are pure .NET.
It's near-perfectly parallelizable though. On 8 cores it should be 33% faster than on 6 that I have. With 32GB RAM the file should be cached by an OS after the first read. The first read may be very slow in the cloud VM, but then the cache should eliminate the difference between drive speeds.
Use naive double parsing
If we can assume that the float values are well formed then the speed almost doubles.
Processed in 00:00:05.5519479
Optimized double parsing with fallback
No assumptions are required if we fallback to the full .NET parsing implementation on any irregularity.
Processed in 00:00:05.2944041
Processed in 00:00:05.3489315
Cache powers of 10, inline summary.init
Processed in 00:00:04.7363095
Processed in 00:00:04.8472097
Processed in 00:00:04.8235814
Processed in 00:00:04.7163938
Microoptimize float parsing, but keep it general purpose
Processed in 00:00:04.4547973
Processed in 00:00:04.5303938
Processed in 00:00:04.5125394
Optimize hash function
See comments in Utf8Span.GetHashCode
Processed in 00:00:04.2237865
Processed in 00:00:04.2524434
Processed in 00:00:04.2688423
Use specification to use int parsing and branchless min/max
Processed in 00:00:03.9916535 Processed in 00:00:03.9897462 Processed in 00:00:03.9810353
Set dictionary capacity
Set dictionary capacity to 10k.
`\n` only
Processed in 00:00:03.2530659
Processed in 00:00:03.1561451
`\r\n`
Processed in 00:00:03.3463769
Processed in 00:00:03.3641962
Processed in 00:00:03.3762491
Manual SIMD: find boundaries and simplify parsing
Use AVX2-optimized IndexOf
.
Find boundaries for ;
, .
and \n
.
Optimize/simplify int parsing between ;
and .
and use a single digit that is always after .
.
Processed in 00:00:03.0687066
Processed in 00:00:03.0819551
Use int16 for min/max in Summary
Struct size is 16 👌
Avoid zero extension, optimize locals & loop
Avoid possible zero extension overhead in ParseInt and Hash, optimize ParseInt locals & loop
Extending short => int
is more expensive than short => ushort => uint => int
if all numbers are positive.
ParseInt loop was not good
Optimize IndexOf
The initial implementation was lazy with more operations than needed
This includes 3 last changes
Processed in 00:00:03.0355594
Processed in 00:00:02.9863322
Processed in 00:00:03.0102503