arrow icon indicating copy to clipboard operation
arrow copied to clipboard

ARROW-15538: [C++] Expanding coverage of math functions from Substrait to Acero

Open drabastomek opened this issue 2 years ago • 9 comments

This PR adds the following functionality to Acero/Substrait consumer

  1. Support for logarithmic functions.
  2. Support for power, sqrt, exp, abs, sign.
  3. Implements exp kernel.

drabastomek avatar Oct 17 '22 05:10 drabastomek

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions[bot] avatar Oct 17 '22 05:10 github-actions[bot]

General note that applies more broadly than just this PR: I don't think DecodeOptionlessOverflowableArithmetic is suitable for these functions, nor for division or for any of the floating-point versions of the functions it's already used for. The problem is that these functions all have different sets of options (integer overflow, floating point rounding, and/or domain error handling), while DecodeOptionlessOverflowableArithmetic only seems to handle the integer overflow option. In addition, it expects <name>_checked kernels to exist in addition to the normal kernels to handle the "throw an error on integer overflow" option. Also, Substrait doesn't even define functions that are generally only applied to floating point numbers (like exp) for integers at all. That's not necessarily a problem for consumption, though.

That being said, the way options are passed in Substrait is currently being revised, so I don't know how much value there is in fixing this short-term/maybe it's fine for now as a placeholder. I'm not at all up-to-date on planning for Acero-Substrait.

jvanstraten avatar Oct 17 '22 13:10 jvanstraten

https://issues.apache.org/jira/browse/ARROW-15538

github-actions[bot] avatar Oct 17 '22 17:10 github-actions[bot]

:warning: Ticket has not been started in JIRA, please click 'Start Progress'.

github-actions[bot] avatar Oct 17 '22 17:10 github-actions[bot]

Thanks @jvanstraten. You're right about the exp not being defined for ints - I changed the calls and added _checked method too.

I agree that currently all the _checked methods only check for int overflows and it might be something we might want to address at some point. Not sure if this should be a part of this PR tho.

drabastomek avatar Oct 17 '22 19:10 drabastomek

I don't think DecodeOptionlessOverflowableArithmetic is suitable for these functions, nor for division or for any of the floating-point versions of the functions it's already used for.

At the moment we should focus on exposing the capability that we have, and describing it accurately. For example, in Arrow we have "_checked" but "what exactly is checked?" For example, in Arrow we have divide and divide_checked. They seem to behave as follows (through trial and error):

Method Data Type Problem Behavior
divide integral overflow silent
divide integral divide by zero error
divide integral domain error error
divide floating divide by zero limit
divide floating domain error nan
divide_checked integral overflow error
divide_checked integral divide by zero error
divide_checked integral domain error error
divide_checked floating divide by zero error
divide_checked floating domain error error

Thus we should have the following behavior for substrait calls

Call Data Type Options Arrow Result
divide integral overflow=SILENT divide
divide integral overflow=SATURATE reject plan
divide integral overflow=ERROR divide_checked
divide floating rounding=* reject plan[1]
divide floating on_domain_error=NAN divide
divide floating on_domain_error=ERROR divide_checked
divide floating on_division_by_zero=LIMIT divide
divide floating on_division_by_zero=NAN reject plan
divide floating on_division_by_zero=ERROR divide_checked
divide floating on_domain_error=NAN on_division_by_zero=ERROR reject plan[2]
divide floating on_domain_error=ERROR on_division_by_zero=LIMIT reject plan[2]
  1. I'm not sure what rounding behavior we have implemented in Acero and therefore the safest thing to do would be to reject all plans that are asking for a specific rounding behavior.
  2. We cannot satisfy these specific combinations between divide and divide_checked.

Caveat / Needs clarification:

IMHO, It is not clear, in Substrait, if division by zero for integral arguments counts as overflow or not. Also, it is possible to have 0/0 in integer arithmetic, which I have been interpreting as "domain error" (technically division by zero requires non-zero/0) but have no idea what the integral behavior should be.

westonpace avatar Oct 17 '22 22:10 westonpace

I'm not sure what rounding behavior we have implemented in Acero and therefore the safest thing to do would be to reject all plans that are asking for a specific rounding behavior.

Assuming that the arithmetic operations are all ultimately implemented using standard library operations, you can query the current mode with fegetround() (it's a global processor state thing). But unless an Arrow user goes out of their way to change the mode and this somehow propagates to all threads correctly, it's just going to be FE_TONEAREST, which corresponds with Substrait TIE_TO_EVEN (TIE_AWAY_FROM_ZERO does not seem to be implemented by the standard library, the rest of the mapping should be fairly straightforward).

ETA: at least... that's how it works for the GNU C library. Looking at the C++ documentation better it might only affect some functions explicitly related to rounding. But in any case, any sane processor architecture and floating-point library implementation will be using TIE_TO_EVEN unless otherwise specified, because it is the rounding mode that is least likely to be statistically biased for practical algorithms. It seems like a safe bet to me to accept both unspecified and TIE_TO_EVEN.

jvanstraten avatar Oct 18 '22 15:10 jvanstraten

It seems like a safe bet to me to accept both unspecified and TIE_TO_EVEN.

Worst case we find out when we get around to implementing Substrait function testing for this specific option :laughing: Nightmare fuel for @richtia

westonpace avatar Oct 18 '22 21:10 westonpace

Pinged @lidavidm / @pitrou for a brief review of the Exp kernel. You can probably ignore the Substrait part if you want.

westonpace avatar Oct 21 '22 18:10 westonpace

@drabastomek can I help move this forward?

rok avatar Nov 15 '22 13:11 rok

@rok Yes, I think that would be helpful. I had agreed with @drabastomek offline to finish this up for him but hadn't gotten around to it yet. If you have a few cycles to address the review comments that would be appreciated!

westonpace avatar Nov 15 '22 17:11 westonpace

@westonpace @rok Thanks guys! I had this on my radar for this week. I addressed the comments but would be great if @rok could have another look -- much appreciated!

drabastomek avatar Nov 15 '22 17:11 drabastomek

@pitrou I've agreed with @drabastomek to implement the review feedback. Could you please take another look?

rok avatar Nov 16 '22 11:11 rok

@pitrou do you think this needs another pass?

rok avatar Nov 17 '22 14:11 rok

Nice! Thanks @drabastomek & all!

rok avatar Nov 17 '22 15:11 rok

Benchmark runs are scheduled for baseline = 0f87e6bf89211a989e4b77990b35d157655ea1a8 and contender = b5b0282516e04fd27cf59f3c05e849967d997cb4. b5b0282516e04fd27cf59f3c05e849967d997cb4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. Conbench compare runs links: [Failed :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2 [Failed :arrow_down:0.74% :arrow_up:0.0%] test-mac-arm [Failed :arrow_down:0.0% :arrow_up:0.0%] ursa-i9-9960x [Failed :arrow_down:0.07% :arrow_up:0.0%] ursa-thinkcentre-m75q Buildkite builds: [Finished] b5b02825 ec2-t3-xlarge-us-east-2 [Failed] b5b02825 test-mac-arm [Failed] b5b02825 ursa-i9-9960x [Failed] b5b02825 ursa-thinkcentre-m75q [Failed] 0f87e6bf ec2-t3-xlarge-us-east-2 [Finished] 0f87e6bf test-mac-arm [Finished] 0f87e6bf ursa-i9-9960x [Finished] 0f87e6bf ursa-thinkcentre-m75q Supported benchmarks: ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True test-mac-arm: Supported benchmark langs: C++, Python, R ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot avatar Nov 17 '22 21:11 ursabot

Thank you all for the support with this!

drabastomek avatar Nov 18 '22 19:11 drabastomek