Optimize lines_crlf
This change is not ready to be merged because it does not yet support x86. I don't know how to do the operation we need with SSE intrinsics (see #18).
This new approach adds a another method to ByteChunk that lets you shift a value across two vectors. This let's us turn a 2-pass counting algorithm into a single pass. Making this change alone resulted in a 40% speedup.
There is also some loop unrolling, but there was not a large difference in unrolling over a factor of 2.
benchmarks
aarch64 benchmarks
lines_crlf::count_breaks/lines_100_LF
time: [6.5590 ns 6.5608 ns 6.5632 ns]
thrpt: [16.035 GiB/s 16.041 GiB/s 16.045 GiB/s]
change:
time: [-29.627% -29.504% -29.396%] (p = 0.00 < 0.05)
thrpt: [+41.634% +41.851% +42.100%]
Performance has improved.
lines_crlf::count_breaks/lines_100_CR
time: [6.5589 ns 6.5658 ns 6.5773 ns]
thrpt: [16.001 GiB/s 16.028 GiB/s 16.045 GiB/s]
change:
time: [-29.667% -29.542% -29.417%] (p = 0.00 < 0.05)
thrpt: [+41.677% +41.928% +42.181%]
Performance has improved.
lines_crlf::count_breaks/lines_100_CRLF
time: [10.238 ns 10.287 ns 10.389 ns]
thrpt: [10.578 GiB/s 10.683 GiB/s 10.734 GiB/s]
change:
time: [-17.826% -17.581% -17.271%] (p = 0.00 < 0.05)
thrpt: [+20.877% +21.332% +21.692%]
Performance has improved.
lines_crlf::count_breaks/lines_1000_LF
time: [55.172 ns 55.218 ns 55.277 ns]
thrpt: [19.039 GiB/s 19.059 GiB/s 19.075 GiB/s]
change:
time: [-58.229% -58.153% -58.074%] (p = 0.00 < 0.05)
thrpt: [+138.51% +138.97% +139.40%]
Performance has improved.
lines_crlf::count_breaks/lines_1000_CR
time: [55.204 ns 55.294 ns 55.397 ns]
thrpt: [18.997 GiB/s 19.033 GiB/s 19.064 GiB/s]
change:
time: [-57.971% -57.894% -57.818%] (p = 0.00 < 0.05)
thrpt: [+137.07% +137.50% +137.93%]
Performance has improved.
lines_crlf::count_breaks/lines_1000_CRLF
time: [58.350 ns 58.377 ns 58.411 ns]
thrpt: [18.814 GiB/s 18.825 GiB/s 18.834 GiB/s]
change:
time: [-57.676% -57.594% -57.517%] (p = 0.00 < 0.05)
thrpt: [+135.39% +135.82% +136.27%]
Performance has improved.
lines_crlf::count_breaks/lines_10000_LF
time: [474.85 ns 475.57 ns 476.47 ns]
thrpt: [22.087 GiB/s 22.129 GiB/s 22.163 GiB/s]
change:
time: [-60.927% -60.850% -60.763%] (p = 0.00 < 0.05)
thrpt: [+154.86% +155.43% +155.93%]
Performance has improved.
lines_crlf::count_breaks/lines_10000_CR
time: [474.14 ns 474.34 ns 474.60 ns]
thrpt: [22.174 GiB/s 22.187 GiB/s 22.196 GiB/s]
change:
time: [-60.732% -60.666% -60.600%] (p = 0.00 < 0.05)
thrpt: [+153.81% +154.23% +154.66%]
Performance has improved.
lines_crlf::count_breaks/lines_10000_CRLF
time: [496.71 ns 496.91 ns 497.17 ns]
thrpt: [22.104 GiB/s 22.116 GiB/s 22.125 GiB/s]
change:
time: [-60.879% -60.813% -60.736%] (p = 0.00 < 0.05)
thrpt: [+154.68% +155.19% +155.62%]
Performance has improved.
lines_crlf::from_byte_idx/lines_100_LF
time: [6.6764 ns 6.6804 ns 6.6849 ns]
thrpt: [15.743 GiB/s 15.754 GiB/s 15.763 GiB/s]
change:
time: [-28.329% -28.193% -28.053%] (p = 0.00 < 0.05)
thrpt: [+38.991% +39.263% +39.527%]
Performance has improved.
lines_crlf::from_byte_idx/lines_100_CR
time: [6.6766 ns 6.6823 ns 6.6894 ns]
thrpt: [15.732 GiB/s 15.749 GiB/s 15.762 GiB/s]
change:
time: [-28.438% -28.289% -28.135%] (p = 0.00 < 0.05)
thrpt: [+39.150% +39.448% +39.739%]
Performance has improved.
lines_crlf::from_byte_idx/lines_100_CRLF
time: [10.398 ns 10.403 ns 10.408 ns]
thrpt: [10.559 GiB/s 10.564 GiB/s 10.569 GiB/s]
change:
time: [-16.452% -16.284% -16.133%] (p = 0.00 < 0.05)
thrpt: [+19.237% +19.451% +19.692%]
Performance has improved.
lines_crlf::from_byte_idx/lines_1000_LF
time: [54.964 ns 55.003 ns 55.054 ns]
thrpt: [19.116 GiB/s 19.133 GiB/s 19.147 GiB/s]
change:
time: [-58.536% -58.452% -58.373%] (p = 0.00 < 0.05)
thrpt: [+140.23% +140.68% +141.17%]
Performance has improved.
lines_crlf::from_byte_idx/lines_1000_CR
time: [54.934 ns 54.948 ns 54.966 ns]
thrpt: [19.146 GiB/s 19.153 GiB/s 19.157 GiB/s]
change:
time: [-58.211% -58.139% -58.062%] (p = 0.00 < 0.05)
thrpt: [+138.45% +138.89% +139.30%]
Performance has improved.
lines_crlf::from_byte_idx/lines_1000_CRLF
time: [58.496 ns 58.511 ns 58.531 ns]
thrpt: [18.776 GiB/s 18.782 GiB/s 18.787 GiB/s]
change:
time: [-57.525% -57.453% -57.373%] (p = 0.00 < 0.05)
thrpt: [+134.59% +135.04% +135.43%]
Performance has improved.
lines_crlf::from_byte_idx/lines_10000_LF
time: [474.25 ns 474.40 ns 474.63 ns]
thrpt: [22.173 GiB/s 22.184 GiB/s 22.191 GiB/s]
change:
time: [-60.981% -60.917% -60.851%] (p = 0.00 < 0.05)
thrpt: [+155.43% +155.86% +156.29%]
Performance has improved.
lines_crlf::from_byte_idx/lines_10000_CR
time: [474.25 ns 474.41 ns 474.67 ns]
thrpt: [22.171 GiB/s 22.183 GiB/s 22.190 GiB/s]
change:
time: [-60.614% -60.546% -60.479%] (p = 0.00 < 0.05)
thrpt: [+153.03% +153.46% +153.90%]
Performance has improved.
lines_crlf::from_byte_idx/lines_10000_CRLF
time: [498.51 ns 499.28 ns 500.21 ns]
thrpt: [21.970 GiB/s 22.011 GiB/s 22.045 GiB/s]
change:
time: [-60.672% -60.594% -60.520%] (p = 0.00 < 0.05)
thrpt: [+153.29% +153.77% +154.27%]
Performance has improved.
lines_crlf::to_byte_idx/lines_100_LF
time: [6.3709 ns 6.3828 ns 6.4011 ns]
thrpt: [16.441 GiB/s 16.488 GiB/s 16.519 GiB/s]
change:
time: [-33.821% -33.678% -33.516%] (p = 0.00 < 0.05)
thrpt: [+50.412% +50.779% +51.106%]
Performance has improved.
lines_crlf::to_byte_idx/lines_100_CR
time: [6.3680 ns 6.3697 ns 6.3720 ns]
thrpt: [16.516 GiB/s 16.522 GiB/s 16.526 GiB/s]
change:
time: [-35.982% -35.870% -35.759%] (p = 0.00 < 0.05)
thrpt: [+55.664% +55.933% +56.205%]
Performance has improved.
lines_crlf::to_byte_idx/lines_100_CRLF
time: [9.9426 ns 9.9496 ns 9.9579 ns]
thrpt: [11.036 GiB/s 11.045 GiB/s 11.053 GiB/s]
change:
time: [-23.824% -23.695% -23.559%] (p = 0.00 < 0.05)
thrpt: [+30.820% +31.054% +31.275%]
Performance has improved.
lines_crlf::to_byte_idx/lines_1000_LF
time: [54.693 ns 54.740 ns 54.800 ns]
thrpt: [19.204 GiB/s 19.225 GiB/s 19.242 GiB/s]
change:
time: [-34.536% -34.394% -34.244%] (p = 0.00 < 0.05)
thrpt: [+52.077% +52.424% +52.755%]
Performance has improved.
lines_crlf::to_byte_idx/lines_1000_CR
time: [54.769 ns 54.804 ns 54.846 ns]
thrpt: [19.188 GiB/s 19.203 GiB/s 19.215 GiB/s]
change:
time: [-36.013% -35.896% -35.787%] (p = 0.00 < 0.05)
thrpt: [+55.731% +55.997% +56.283%]
Performance has improved.
lines_crlf::to_byte_idx/lines_1000_CRLF
time: [58.142 ns 58.164 ns 58.191 ns]
thrpt: [18.885 GiB/s 18.894 GiB/s 18.901 GiB/s]
change:
time: [-35.565% -35.455% -35.343%] (p = 0.00 < 0.05)
thrpt: [+54.662% +54.930% +55.194%]
Performance has improved.
lines_crlf::to_byte_idx/lines_10000_LF
time: [473.48 ns 473.64 ns 473.92 ns]
thrpt: [22.206 GiB/s 22.219 GiB/s 22.227 GiB/s]
change:
time: [-39.204% -39.100% -38.995%] (p = 0.00 < 0.05)
thrpt: [+63.922% +64.203% +64.484%]
Performance has improved.
lines_crlf::to_byte_idx/lines_10000_CR
time: [473.50 ns 473.67 ns 473.95 ns]
thrpt: [22.205 GiB/s 22.218 GiB/s 22.226 GiB/s]
change:
time: [-39.481% -39.384% -39.290%] (p = 0.00 < 0.05)
thrpt: [+64.718% +64.974% +65.238%]
Performance has improved.
lines_crlf::to_byte_idx/lines_10000_CRLF
time: [497.40 ns 497.47 ns 497.55 ns]
thrpt: [22.087 GiB/s 22.091 GiB/s 22.094 GiB/s]
change:
time: [-39.299% -39.184% -39.061%] (p = 0.00 < 0.05)
thrpt: [+64.100% +64.430% +64.743%]
Performance has improved.
Thanks! The general idea seems good (though I haven't reviewed the code closely). I do want to keep the algorithm the same across architectures, however, just to keep the maintenance burden reasonable. So if we can't find a performant way to implement shift_across() on x86_64, we might need to leave this on the table.
I tried a basic approach to define it like this
#[inline(always)]
fn shift_across(&self, n: Self) -> Self {
unsafe {
let n = x86_64::_mm_slli_si128(n, 1);
let upper = core::mem::transmute::<Self, [u8; Self::SIZE]>(*self);
let mut lower = core::mem::transmute::<Self, [u8; Self::SIZE]>(n);
lower[0] = upper[Self::SIZE - 1];
core::mem::transmute::<[u8; Self::SIZE], Self>(lower)
}
}
but that resulted in a major 4x slowdown across all benchmarks. So it is going to need to be a different approach.
I no longer have time to work on this and I won't be submitting any more perf PR's for the immediate future. Feel free to close this if you don't want try and make this approach work.
@cessen I happened to come across an efficient way to do this on x86. I implemented it and it works well. Benchmarks are below. Throughput has increase 20-80% depending on the benchmark with this single pass approach.
x86_64 benchmarks
lines_crlf::count_breaks/lines_100_LF
time: [23.226 ns 23.232 ns 23.239 ns]
thrpt: [4.5286 GiB/s 4.5299 GiB/s 4.5311 GiB/s]
change:
time: [-16.945% -16.888% -16.841%] (p = 0.00 < 0.05)
thrpt: [+20.252% +20.319% +20.401%]
Performance has improved.
lines_crlf::count_breaks/lines_100_CR
time: [23.226 ns 23.231 ns 23.236 ns]
thrpt: [4.5291 GiB/s 4.5301 GiB/s 4.5311 GiB/s]
change:
time: [-16.881% -16.828% -16.744%] (p = 0.00 < 0.05)
thrpt: [+20.111% +20.233% +20.310%]
Performance has improved.
lines_crlf::count_breaks/lines_100_CRLF
time: [30.983 ns 30.990 ns 30.996 ns]
thrpt: [3.5455 GiB/s 3.5462 GiB/s 3.5469 GiB/s]
change:
time: [-9.7291% -9.6790% -9.6373%] (p = 0.00 < 0.05)
thrpt: [+10.665% +10.716% +10.778%]
Performance has improved.
lines_crlf::count_breaks/lines_1000_LF
time: [177.77 ns 177.79 ns 177.81 ns]
thrpt: [5.9187 GiB/s 5.9194 GiB/s 5.9201 GiB/s]
change:
time: [-32.601% -32.577% -32.560%] (p = 0.00 < 0.05)
thrpt: [+48.281% +48.318% +48.370%]
Performance has improved.
lines_crlf::count_breaks/lines_1000_CR
time: [177.83 ns 177.86 ns 177.89 ns]
thrpt: [5.9160 GiB/s 5.9170 GiB/s 5.9179 GiB/s]
change:
time: [-24.170% -24.130% -24.097%] (p = 0.00 < 0.05)
thrpt: [+31.747% +31.805% +31.874%]
Performance has improved.
lines_crlf::count_breaks/lines_1000_CRLF
time: [187.96 ns 187.99 ns 188.02 ns]
thrpt: [5.8450 GiB/s 5.8459 GiB/s 5.8469 GiB/s]
change:
time: [-25.359% -25.333% -25.313%] (p = 0.00 < 0.05)
thrpt: [+33.892% +33.929% +33.974%]
Performance has improved.
lines_crlf::count_breaks/lines_10000_LF
time: [1.6073 µs 1.6073 µs 1.6073 µs]
thrpt: [6.5475 GiB/s 6.5477 GiB/s 6.5478 GiB/s]
change:
time: [-32.045% -31.967% -31.889%] (p = 0.00 < 0.05)
thrpt: [+46.819% +46.987% +47.156%]
Performance has improved.
lines_crlf::count_breaks/lines_10000_CR
time: [1.6073 µs 1.6075 µs 1.6077 µs]
thrpt: [6.5460 GiB/s 6.5469 GiB/s 6.5474 GiB/s]
change:
time: [-22.743% -22.714% -22.677%] (p = 0.00 < 0.05)
thrpt: [+29.327% +29.389% +29.438%]
Performance has improved.
lines_crlf::count_breaks/lines_10000_CRLF
time: [1.6836 µs 1.6837 µs 1.6839 µs]
thrpt: [6.5264 GiB/s 6.5270 GiB/s 6.5274 GiB/s]
change:
time: [-23.364% -23.330% -23.293%] (p = 0.00 < 0.05)
thrpt: [+30.367% +30.430% +30.486%]
Performance has improved.
lines_crlf::from_byte_idx/lines_100_LF
time: [23.461 ns 23.465 ns 23.468 ns]
thrpt: [4.4843 GiB/s 4.4850 GiB/s 4.4856 GiB/s]
change:
time: [-17.056% -17.001% -16.947%] (p = 0.00 < 0.05)
thrpt: [+20.406% +20.484% +20.564%]
Performance has improved.
lines_crlf::from_byte_idx/lines_100_CR
time: [23.460 ns 23.463 ns 23.466 ns]
thrpt: [4.4848 GiB/s 4.4854 GiB/s 4.4860 GiB/s]
change:
time: [-17.053% -16.992% -16.926%] (p = 0.00 < 0.05)
thrpt: [+20.375% +20.470% +20.559%]
Performance has improved.
lines_crlf::from_byte_idx/lines_100_CRLF
time: [31.122 ns 31.128 ns 31.135 ns]
thrpt: [3.5297 GiB/s 3.5305 GiB/s 3.5312 GiB/s]
change:
time: [-9.9060% -9.7928% -9.7091%] (p = 0.00 < 0.05)
thrpt: [+10.753% +10.856% +10.995%]
Performance has improved.
lines_crlf::from_byte_idx/lines_1000_LF
time: [178.06 ns 178.09 ns 178.13 ns]
thrpt: [5.9081 GiB/s 5.9093 GiB/s 5.9104 GiB/s]
change:
time: [-32.661% -32.630% -32.605%] (p = 0.00 < 0.05)
thrpt: [+48.379% +48.433% +48.503%]
Performance has improved.
lines_crlf::from_byte_idx/lines_1000_CR
time: [178.09 ns 178.13 ns 178.16 ns]
thrpt: [5.9069 GiB/s 5.9081 GiB/s 5.9092 GiB/s]
change:
time: [-24.232% -24.208% -24.189%] (p = 0.00 < 0.05)
thrpt: [+31.907% +31.940% +31.981%]
Performance has improved.
lines_crlf::from_byte_idx/lines_1000_CRLF
time: [188.19 ns 188.24 ns 188.30 ns]
thrpt: [5.8363 GiB/s 5.8382 GiB/s 5.8395 GiB/s]
change:
time: [-25.091% -25.054% -24.999%] (p = 0.00 < 0.05)
thrpt: [+33.331% +33.429% +33.496%]
Performance has improved.
lines_crlf::from_byte_idx/lines_10000_LF
time: [1.6065 µs 1.6066 µs 1.6069 µs]
thrpt: [6.5494 GiB/s 6.5503 GiB/s 6.5509 GiB/s]
change:
time: [-32.121% -32.038% -31.949%] (p = 0.00 < 0.05)
thrpt: [+46.948% +47.141% +47.322%]
Performance has improved.
lines_crlf::from_byte_idx/lines_10000_CR
time: [1.6066 µs 1.6068 µs 1.6071 µs]
thrpt: [6.5482 GiB/s 6.5497 GiB/s 6.5506 GiB/s]
change:
time: [-22.507% -22.465% -22.422%] (p = 0.00 < 0.05)
thrpt: [+28.902% +28.974% +29.044%]
Performance has improved.
lines_crlf::from_byte_idx/lines_10000_CRLF
time: [1.6709 µs 1.6710 µs 1.6711 µs]
thrpt: [6.5761 GiB/s 6.5767 GiB/s 6.5770 GiB/s]
change:
time: [-24.019% -23.972% -23.934%] (p = 0.00 < 0.05)
thrpt: [+31.464% +31.530% +31.611%]
Performance has improved.
lines_crlf::to_byte_idx/lines_100_LF
time: [27.695 ns 27.703 ns 27.712 ns]
thrpt: [3.7977 GiB/s 3.7989 GiB/s 3.8000 GiB/s]
change:
time: [-22.503% -22.464% -22.430%] (p = 0.00 < 0.05)
thrpt: [+28.915% +28.972% +29.038%]
Performance has improved.
lines_crlf::to_byte_idx/lines_100_CR
time: [27.703 ns 27.714 ns 27.728 ns]
thrpt: [3.7954 GiB/s 3.7973 GiB/s 3.7988 GiB/s]
change:
time: [-24.568% -24.497% -24.424%] (p = 0.00 < 0.05)
thrpt: [+32.316% +32.445% +32.570%]
Performance has improved.
lines_crlf::to_byte_idx/lines_100_CRLF
time: [38.603 ns 38.621 ns 38.639 ns]
thrpt: [2.8442 GiB/s 2.8455 GiB/s 2.8468 GiB/s]
change:
time: [-28.157% -28.091% -28.024%] (p = 0.00 < 0.05)
thrpt: [+38.934% +39.065% +39.192%]
Performance has improved.
lines_crlf::to_byte_idx/lines_1000_LF
time: [220.19 ns 220.27 ns 220.37 ns]
thrpt: [4.7757 GiB/s 4.7777 GiB/s 4.7794 GiB/s]
change:
time: [-37.749% -37.677% -37.608%] (p = 0.00 < 0.05)
thrpt: [+60.276% +60.454% +60.640%]
Performance has improved.
lines_crlf::to_byte_idx/lines_1000_CR
time: [220.16 ns 220.21 ns 220.26 ns]
thrpt: [4.7779 GiB/s 4.7790 GiB/s 4.7800 GiB/s]
change:
time: [-41.670% -41.648% -41.626%] (p = 0.00 < 0.05)
thrpt: [+71.310% +71.375% +71.437%]
Performance has improved.
lines_crlf::to_byte_idx/lines_1000_CRLF
time: [231.20 ns 231.32 ns 231.46 ns]
thrpt: [4.7480 GiB/s 4.7507 GiB/s 4.7534 GiB/s]
change:
time: [-39.793% -39.715% -39.640%] (p = 0.00 < 0.05)
thrpt: [+65.672% +65.877% +66.094%]
Performance has improved.
lines_crlf::to_byte_idx/lines_10000_LF
time: [1.9426 µs 1.9427 µs 1.9427 µs]
thrpt: [5.4171 GiB/s 5.4173 GiB/s 5.4174 GiB/s]
change:
time: [-38.454% -38.407% -38.375%] (p = 0.00 < 0.05)
thrpt: [+62.272% +62.357% +62.480%]
Performance has improved.
lines_crlf::to_byte_idx/lines_10000_CR
time: [1.9427 µs 1.9428 µs 1.9430 µs]
thrpt: [5.4164 GiB/s 5.4169 GiB/s 5.4172 GiB/s]
change:
time: [-44.405% -44.392% -44.372%] (p = 0.00 < 0.05)
thrpt: [+79.766% +79.829% +79.872%]
Performance has improved.
lines_crlf::to_byte_idx/lines_10000_CRLF
time: [2.0376 µs 2.0381 µs 2.0386 µs]
thrpt: [5.3907 GiB/s 5.3922 GiB/s 5.3933 GiB/s]
change:
time: [-39.446% -39.427% -39.385%] (p = 0.00 < 0.05)
thrpt: [+64.977% +65.091% +65.141%]
Performance has improved.
Awesome, thanks! I'm a little busy with other things right now, but I'll try to get this reviewed and merged soon.
I'd like to see the numbers when processing just one chunk at a time, to make a determination on whether the added code complexity is worth it.
After writing that, I realized I could at least test on x86_64 to see what the impact is there. So to answer my own question: on x86_64 the impact is roughly a 20% performance hit across the board. And that does seem pretty significant. So let's leave the loop-unrolling in.
I have applied your feedback.
Awesome, thanks!
The CI failure is unrelated to this PR, and is rather because a transitive dev-dependency used in test running requires > Rust 1.65 (which we still support and therefore have in the CI). I'll need to address that separately. This PR is good to land.
Also just cut a release (v0.4.4) with the new optimized code.
Thanks a bunch for putting in the time to figure this out and implement it!