arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-46777: [C++] Use SimplifyIsIn only when the value_set of the expression is lower than a threshold

Open raulcd opened this issue 6 months ago • 2 comments

Rationale for this change

Using SimplifyIsIn when the value set is large has a substantial performance penalty.

What changes are included in this PR?

Ensure we do not use the simplification when the value_set on the expression is higher than a threshold (50).

Are these changes tested?

I've tested locally that the reproducer goes back to pre change levels.

$ python read.py 
=== PYARROW VERSION 20 ===
Retrieved 10,000,000 rows in 3.08 seconds.

Are there any user-facing changes?

No

  • GitHub Issue: #46777

raulcd avatar Jun 19 '25 09:06 raulcd

:warning: GitHub issue #46777 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar Jun 19 '25 09:06 github-actions[bot]

Do we have any benchmarks for expression simplification already? Otherwise, we shouldn't bother adding any.

pitrou avatar Jun 19 '25 15:06 pitrou

It would be nice to have this in 21.0. Do you want to update this PR @raulcd ?

pitrou avatar Jul 02 '25 08:07 pitrou

Sure, I am working on it at the moment, will try to push soon

raulcd avatar Jul 02 '25 11:07 raulcd

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 0b34e6bed40d48ae44a137afd196af94d9117e3b.

There were 9 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 53 possible false positives for unstable benchmarks that are known to sometimes produce them.