GH-46777: [C++] Use SimplifyIsIn only when the value_set of the expression is lower than a threshold
Rationale for this change
Using SimplifyIsIn when the value set is large has a substantial performance penalty.
What changes are included in this PR?
Ensure we do not use the simplification when the value_set on the expression is higher than a threshold (50).
Are these changes tested?
I've tested locally that the reproducer goes back to pre change levels.
$ python read.py
=== PYARROW VERSION 20 ===
Retrieved 10,000,000 rows in 3.08 seconds.
Are there any user-facing changes?
No
- GitHub Issue: #46777
:warning: GitHub issue #46777 has been automatically assigned in GitHub to PR creator.
Do we have any benchmarks for expression simplification already? Otherwise, we shouldn't bother adding any.
It would be nice to have this in 21.0. Do you want to update this PR @raulcd ?
Sure, I am working on it at the moment, will try to push soon
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 0b34e6bed40d48ae44a137afd196af94d9117e3b.
There were 9 benchmark results indicating a performance regression:
- Commit Run on
test-mac-armat 2025-07-07 11:41:06Z - and 7 more (see the report linked below)
The full Conbench report has more details. It also includes information about 53 possible false positives for unstable benchmarks that are known to sometimes produce them.