arrow ARROW-15538: [C++] Expanding coverage of math functions from Substrait to Acero

This PR adds the following functionality to Acero/Substrait consumer

Support for logarithmic functions.
Support for power, sqrt, exp, abs, sign.
Implements exp kernel.

Oct 17 '22 05:10 drabastomek

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

Oct 17 '22 05:10 github-actions[bot]

General note that applies more broadly than just this PR: I don't think DecodeOptionlessOverflowableArithmetic is suitable for these functions, nor for division or for any of the floating-point versions of the functions it's already used for. The problem is that these functions all have different sets of options (integer overflow, floating point rounding, and/or domain error handling), while DecodeOptionlessOverflowableArithmetic only seems to handle the integer overflow option. In addition, it expects <name>_checked kernels to exist in addition to the normal kernels to handle the "throw an error on integer overflow" option. Also, Substrait doesn't even define functions that are generally only applied to floating point numbers (like exp) for integers at all. That's not necessarily a problem for consumption, though.

That being said, the way options are passed in Substrait is currently being revised, so I don't know how much value there is in fixing this short-term/maybe it's fine for now as a placeholder. I'm not at all up-to-date on planning for Acero-Substrait.

Oct 17 '22 13:10 jvanstraten

https://issues.apache.org/jira/browse/ARROW-15538

Oct 17 '22 17:10 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

Oct 17 '22 17:10 github-actions[bot]

Thanks @jvanstraten. You're right about the exp not being defined for ints - I changed the calls and added _checked method too.

I agree that currently all the _checked methods only check for int overflows and it might be something we might want to address at some point. Not sure if this should be a part of this PR tho.

Oct 17 '22 19:10 drabastomek

I don't think DecodeOptionlessOverflowableArithmetic is suitable for these functions, nor for division or for any of the floating-point versions of the functions it's already used for.

At the moment we should focus on exposing the capability that we have, and describing it accurately. For example, in Arrow we have "_checked" but "what exactly is checked?" For example, in Arrow we have divide and divide_checked. They seem to behave as follows (through trial and error):

Method	Data Type	Problem	Behavior
divide	integral	overflow	silent
divide	integral	divide by zero	error
divide	integral	domain error	error
divide	floating	divide by zero	limit
divide	floating	domain error	nan
divide_checked	integral	overflow	error
divide_checked	integral	divide by zero	error
divide_checked	integral	domain error	error
divide_checked	floating	divide by zero	error
divide_checked	floating	domain error	error

Thus we should have the following behavior for substrait calls

Call	Data Type	Options	Arrow Result
divide	integral	overflow=SILENT	divide
divide	integral	overflow=SATURATE	reject plan
divide	integral	overflow=ERROR	divide_checked
divide	floating	rounding=*	reject plan[1]
divide	floating	on_domain_error=NAN	divide
divide	floating	on_domain_error=ERROR	divide_checked
divide	floating	on_division_by_zero=LIMIT	divide
divide	floating	on_division_by_zero=NAN	reject plan
divide	floating	on_division_by_zero=ERROR	divide_checked
divide	floating	on_domain_error=NAN on_division_by_zero=ERROR	reject plan[2]
divide	floating	on_domain_error=ERROR on_division_by_zero=LIMIT	reject plan[2]

I'm not sure what rounding behavior we have implemented in Acero and therefore the safest thing to do would be to reject all plans that are asking for a specific rounding behavior.
We cannot satisfy these specific combinations between divide and divide_checked.

Caveat / Needs clarification:

IMHO, It is not clear, in Substrait, if division by zero for integral arguments counts as overflow or not. Also, it is possible to have 0/0 in integer arithmetic, which I have been interpreting as "domain error" (technically division by zero requires non-zero/0) but have no idea what the integral behavior should be.

Oct 17 '22 22:10 westonpace

I'm not sure what rounding behavior we have implemented in Acero and therefore the safest thing to do would be to reject all plans that are asking for a specific rounding behavior.

Assuming that the arithmetic operations are all ultimately implemented using standard library operations, you can query the current mode with fegetround() (it's a global processor state thing). But unless an Arrow user goes out of their way to change the mode and this somehow propagates to all threads correctly, it's just going to be FE_TONEAREST, which corresponds with Substrait TIE_TO_EVEN (TIE_AWAY_FROM_ZERO does not seem to be implemented by the standard library, the rest of the mapping should be fairly straightforward).

ETA: at least... that's how it works for the GNU C library. Looking at the C++ documentation better it might only affect some functions explicitly related to rounding. But in any case, any sane processor architecture and floating-point library implementation will be using TIE_TO_EVEN unless otherwise specified, because it is the rounding mode that is least likely to be statistically biased for practical algorithms. It seems like a safe bet to me to accept both unspecified and TIE_TO_EVEN.

Oct 18 '22 15:10 jvanstraten

It seems like a safe bet to me to accept both unspecified and TIE_TO_EVEN.

Worst case we find out when we get around to implementing Substrait function testing for this specific option :laughing: Nightmare fuel for @richtia

Oct 18 '22 21:10 westonpace

Pinged @lidavidm / @pitrou for a brief review of the Exp kernel. You can probably ignore the Substrait part if you want.

Oct 21 '22 18:10 westonpace

@drabastomek can I help move this forward?

Nov 15 '22 13:11 rok

@rok Yes, I think that would be helpful. I had agreed with @drabastomek offline to finish this up for him but hadn't gotten around to it yet. If you have a few cycles to address the review comments that would be appreciated!

Nov 15 '22 17:11 westonpace

@westonpace @rok Thanks guys! I had this on my radar for this week. I addressed the comments but would be great if @rok could have another look -- much appreciated!

Nov 15 '22 17:11 drabastomek

@pitrou I've agreed with @drabastomek to implement the review feedback. Could you please take another look?

Nov 16 '22 11:11 rok

@pitrou do you think this needs another pass?

Nov 17 '22 14:11 rok

Nice! Thanks @drabastomek & all!

Nov 17 '22 15:11 rok

Benchmark runs are scheduled for baseline = 0f87e6bf89211a989e4b77990b35d157655ea1a8 and contender = b5b0282516e04fd27cf59f3c05e849967d997cb4. b5b0282516e04fd27cf59f3c05e849967d997cb4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Failed :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2 [Failed :arrow_down:0.74% :arrow_up:0.0%] test-mac-arm [Failed :arrow_down:0.0% :arrow_up:0.0%] ursa-i9-9960x [Failed :arrow_down:0.07% :arrow_up:0.0%] ursa-thinkcentre-m75q Buildkite builds: [Finished] b5b02825 ec2-t3-xlarge-us-east-2 [Failed] b5b02825 test-mac-arm [Failed] b5b02825 ursa-i9-9960x [Failed] b5b02825 ursa-thinkcentre-m75q [Failed] 0f87e6bf ec2-t3-xlarge-us-east-2 [Finished] 0f87e6bf test-mac-arm [Finished] 0f87e6bf ursa-i9-9960x [Finished] 0f87e6bf ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Nov 17 '22 21:11 ursabot

Thank you all for the support with this!

Nov 18 '22 19:11 drabastomek

arrow arrow copied to clipboard

ARROW-15538: [C++] Expanding coverage of math functions from Substrait to Acero

arrow
arrow copied to clipboard