Geoffrey Claude comments

Results 10 comments of


                                            Geoffrey Claude

add specialized InList implementations for common scalar types

General comment on the benchmark but... am I reading them wrong, or is the `null_percent` input logic inverted? ```rust fn do_benches( c: &mut Criterion, array_length: usize, in_list_length: usize, null_percent: f64,...

add specialized InList implementations for common scalar types

Another general comment, on the implementation this time: hashing seems overkill and probably overly expensive for small simple type lists. @adriangb have you considered sorting the `InList` and doing a...

add specialized InList implementations for common scalar types

> > Another general comment, on the implementation this time: hashing seems overkill and probably overly expensive for small simple type lists. > > @adriangb have you considered sorting the...

add specialized InList implementations for common scalar types

> @Dandandan is already on the list https://github.com/alamb/datafusion-benchmarking/blob/4fb120785fa66ecbf40a45d8a5d0d5f4be17266a/scripts/scrape_comments.py#L41 > > I can add @geoffreyclaude if he would like @alamb yes please :) Especially for when/if I follow through with the...

[Docs] Library User Guide page for adding your own custom SQL syntax

take

[Docs] Library User Guide page for adding your own custom SQL syntax

@alamb I opened https://github.com/apache/datafusion/pull/19265 to close this issue. I kept it relatively concise, in line with the other docs. Let me know if you think it needs to be more...

Further improve performance of IN list evaluation

> What about `slice::contains`? Seems like it should be somewhere between the const-sized approach and binary search in terms of threshold window. It loses all the time against the branchless...

Further improve performance of IN list evaluation

@Dandandan See https://github.com/geoffreyclaude/datafusion/pull/14 for an in-depth micro benchmark and analysis of the different search algorithms. TL;DR: It's always branchless up to the SIMD limit, then hashset. Slice Search Benchmark

Further improve performance of IN list evaluation

I've opened https://github.com/apache/datafusion/pull/19376 as a preliminary PR to extend the benchmarks.

Add Local Scripts to Reproduce Full CI and Perform Auto-Fixes

`./auto-fix.sh` would definitely be super useful! I'd wire it up to be a `git commit` hook.