arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-18185: [C++][Compute] Support KEEP_NULL option for compute::Filter

Open js8544 opened this issue 2 years ago • 3 comments

The current Filter implementation always drops the filtered values. In some use cases, it's desirable for the output array to have the same size as the inut array. So I added a new option FilterOptions::KEEP_NULL where the filtered values are kept as nulls.

For example, with input [1, 2, 3] and filter [true, false, true], the current implementation will output [1, 3] and with the new option it will output [1, null, 3]

This option is simpler to implement since we only need to construct a new validity bitmap and reuse the input buffers and child arrays. Except for dense union arrays which don't have validity bitmaps.

It is also faster to filter with FilterOptions::KEEP_NULL according to the benchmark result in most cases. So users can choose this option for better performance when dropping filtered values is not required.

js8544 avatar Oct 28 '22 03:10 js8544

Benchmark result on my machine: https://gist.github.com/js8544/7a1a1e798e41b42f51ccb4112bd2a2c2 Benchmark name ending with 2 (every third benchmark) is with filtered_value_behavior = KEEP_NULL, it is faster than the other two options in most cases, except for the cases when input data is very large and the selection percentage is extremely small so it's cheaper to copy over the selected values (the select%=1 FilterRecordBatch cases).

js8544 avatar Oct 28 '22 04:10 js8544

https://issues.apache.org/jira/browse/ARROW-18185

github-actions[bot] avatar Oct 28 '22 04:10 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions[bot] avatar Oct 28 '22 04:10 github-actions[bot]

CI failures are unrelated

js8544 avatar Oct 31 '22 03:10 js8544