risinglight icon indicating copy to clipboard operation
risinglight copied to clipboard

expression: optimize branching in arithmetic operations

Open skyzh opened this issue 3 years ago • 4 comments

Currently, we have many branches when fetching items from array...

https://github.com/risinglightdb/risinglight/blob/7b5bc7a1a258afd64a483990ca95f4c799482791/src/array/iterator.rs#L30-L36

https://github.com/risinglightdb/risinglight/blob/7b5bc7a1a258afd64a483990ca95f4c799482791/src/array/primitive_array.rs#L45-L47

This would hurt performance for arithmetic operations. Most of the arithmetic operations can be done on null items (currently stored as 0). We can process all operations at first, and later apply the null bitmap.

skyzh avatar Feb 27 '22 06:02 skyzh

From the discussion in Slack... https://risingwave-community.slack.com/archives/C02UZDEE4AC/p1645885910667749

array arrow mul/4096    time:   [4.2330 ms 4.2809 ms 4.3336 ms]
array arrow mul/65536   time:   [4.2511 ms 4.3411 ms 4.4390 ms]
array arrow mul/1048576 time:   [4.4309 ms 4.5505 ms 4.6773 ms]
array mul simd_op/4096  time:   [7.0365 ms 7.1359 ms 7.2627 ms]
array mul simd_op/65536 time:   [7.0968 ms 7.1104 ms 7.1234 ms]
array mul simd_op/1048576  time:   [7.1510 ms 7.1582 ms 7.1662 ms]

Our implementation is far worse than arrow w/o SIMD :(

skyzh avatar Feb 27 '22 06:02 skyzh

Is there a link to the benchmark code? It looks like overhead is too large that the number of operations does not count much in total timing, which is abnormal.

jiangzhe avatar Mar 04 '22 13:03 jiangzhe

https://github.com/risinglightdb/risinglight/blob/main/benches/array.rs

This is our micro benchmark code. The arrow ones are in arrow2 repo.

skyzh avatar Mar 04 '22 14:03 skyzh

Tried below benchmark.

use risinglight::executor::evaluator;
let a1: I32Array = (0..4096).collect();
let a2: I32Array = (0..4096).collect();
    
let p1: Vec<i32> = (0..4096).collect();
let p2: Vec<i32> = (0..4096).collect();

c.bench_function("array_mul", |b| b.iter(|| {
    let _: I32Array = evaluator::binary_op(&a1, &a2, |a, b| a * b);
}));
c.bench_function("array_mul_simd", |b| b.iter(|| {
    let _: I32Array = evaluator::simd_op::<_, _, _, 32>(&a1, &a2, |a, b| a * b);
}));
c.bench_function("array_mul_vec", |b| b.iter(|| {
    let mut res = vec![0; 4096];
    for ((a, b), c) in p1.iter().zip(p2.iter()).zip(res.iter_mut()) {
        *c = *a * *b;
    }
}));

The result:

array_mul               time:   [33.171 us 33.526 us 34.001 us]
array_mul_simd          time:   [16.958 us 17.224 us 17.523 us]
array_mul_vec            time:   [1.8846 us 1.9008 us 1.9197 us]

It's intuitive that Vec version is fastest, because it does not perform validity check and LLVM will auto-vectorize the multiplication on slice. It's odd that the simd version only as twice faster as the non-simd version. My opinion is that the design of simd_op is not very performant.

  1. BatchIter returns owned BatchItem with data and validity bitmap. There are two memcpy operations to generate a BatchItem. This might be avoided to only have a view(reference) on the input.
  2. It couples validity bitmap merge and binary operation in single loop. The bitmap and data are located at separate memory area, so cache miss happens when operating on them back and forth. Instead, we should separate them in two loops: one loop only merges validity bitmap, the other only performs binary operation. This is actually what arrow, and TiDB do.

jiangzhe avatar Mar 05 '22 01:03 jiangzhe

Resolved by #554 and the subsequent #741.

wangrunji0408 avatar Jan 05 '23 05:01 wangrunji0408