risinglight
risinglight copied to clipboard
expression: optimize branching in arithmetic operations
Currently, we have many branches when fetching items from array...
https://github.com/risinglightdb/risinglight/blob/7b5bc7a1a258afd64a483990ca95f4c799482791/src/array/iterator.rs#L30-L36
https://github.com/risinglightdb/risinglight/blob/7b5bc7a1a258afd64a483990ca95f4c799482791/src/array/primitive_array.rs#L45-L47
This would hurt performance for arithmetic operations. Most of the arithmetic operations can be done on null items (currently stored as 0
). We can process all operations at first, and later apply the null bitmap.
From the discussion in Slack... https://risingwave-community.slack.com/archives/C02UZDEE4AC/p1645885910667749
array arrow mul/4096 time: [4.2330 ms 4.2809 ms 4.3336 ms]
array arrow mul/65536 time: [4.2511 ms 4.3411 ms 4.4390 ms]
array arrow mul/1048576 time: [4.4309 ms 4.5505 ms 4.6773 ms]
array mul simd_op/4096 time: [7.0365 ms 7.1359 ms 7.2627 ms]
array mul simd_op/65536 time: [7.0968 ms 7.1104 ms 7.1234 ms]
array mul simd_op/1048576 time: [7.1510 ms 7.1582 ms 7.1662 ms]
Our implementation is far worse than arrow w/o SIMD :(
Is there a link to the benchmark code? It looks like overhead is too large that the number of operations does not count much in total timing, which is abnormal.
https://github.com/risinglightdb/risinglight/blob/main/benches/array.rs
This is our micro benchmark code. The arrow ones are in arrow2 repo.
Tried below benchmark.
use risinglight::executor::evaluator;
let a1: I32Array = (0..4096).collect();
let a2: I32Array = (0..4096).collect();
let p1: Vec<i32> = (0..4096).collect();
let p2: Vec<i32> = (0..4096).collect();
c.bench_function("array_mul", |b| b.iter(|| {
let _: I32Array = evaluator::binary_op(&a1, &a2, |a, b| a * b);
}));
c.bench_function("array_mul_simd", |b| b.iter(|| {
let _: I32Array = evaluator::simd_op::<_, _, _, 32>(&a1, &a2, |a, b| a * b);
}));
c.bench_function("array_mul_vec", |b| b.iter(|| {
let mut res = vec![0; 4096];
for ((a, b), c) in p1.iter().zip(p2.iter()).zip(res.iter_mut()) {
*c = *a * *b;
}
}));
The result:
array_mul time: [33.171 us 33.526 us 34.001 us]
array_mul_simd time: [16.958 us 17.224 us 17.523 us]
array_mul_vec time: [1.8846 us 1.9008 us 1.9197 us]
It's intuitive that Vec version is fastest, because it does not perform validity check and LLVM will auto-vectorize the multiplication on slice.
It's odd that the simd version only as twice faster as the non-simd version.
My opinion is that the design of simd_op
is not very performant.
-
BatchIter
returns ownedBatchItem
with data and validity bitmap. There are two memcpy operations to generate aBatchItem
. This might be avoided to only have a view(reference) on the input. - It couples validity bitmap merge and binary operation in single loop. The bitmap and data are located at separate memory area, so cache miss happens when operating on them back and forth. Instead, we should separate them in two loops: one loop only merges validity bitmap, the other only performs binary operation. This is actually what arrow, and TiDB do.
Resolved by #554 and the subsequent #741.