arrow
arrow copied to clipboard
ARROW-15538: [C++] Expanding coverage of math functions from Substrait to Acero
This PR adds the following functionality to Acero/Substrait consumer
- Support for logarithmic functions.
- Support for power, sqrt, exp, abs, sign.
- Implements exp kernel.
Thanks for opening a pull request!
If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.
Then could you also rename pull request title in the following format?
ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
or
MINOR: [${COMPONENT}] ${SUMMARY}
See also:
General note that applies more broadly than just this PR: I don't think DecodeOptionlessOverflowableArithmetic
is suitable for these functions, nor for division or for any of the floating-point versions of the functions it's already used for. The problem is that these functions all have different sets of options (integer overflow, floating point rounding, and/or domain error handling), while DecodeOptionlessOverflowableArithmetic
only seems to handle the integer overflow option. In addition, it expects <name>_checked
kernels to exist in addition to the normal kernels to handle the "throw an error on integer overflow" option. Also, Substrait doesn't even define functions that are generally only applied to floating point numbers (like exp
) for integers at all. That's not necessarily a problem for consumption, though.
That being said, the way options are passed in Substrait is currently being revised, so I don't know how much value there is in fixing this short-term/maybe it's fine for now as a placeholder. I'm not at all up-to-date on planning for Acero-Substrait.
https://issues.apache.org/jira/browse/ARROW-15538
:warning: Ticket has not been started in JIRA, please click 'Start Progress'.
Thanks @jvanstraten. You're right about the exp
not being defined for ints - I changed the calls and added _checked
method too.
I agree that currently all the _checked
methods only check for int overflows and it might be something we might want to address at some point. Not sure if this should be a part of this PR tho.
I don't think DecodeOptionlessOverflowableArithmetic is suitable for these functions, nor for division or for any of the floating-point versions of the functions it's already used for.
At the moment we should focus on exposing the capability that we have, and describing it accurately. For example, in Arrow we have "_checked" but "what exactly is checked?" For example, in Arrow we have divide
and divide_checked
. They seem to behave as follows (through trial and error):
Method | Data Type | Problem | Behavior |
---|---|---|---|
divide | integral | overflow | silent |
divide | integral | divide by zero | error |
divide | integral | domain error | error |
divide | floating | divide by zero | limit |
divide | floating | domain error | nan |
divide_checked | integral | overflow | error |
divide_checked | integral | divide by zero | error |
divide_checked | integral | domain error | error |
divide_checked | floating | divide by zero | error |
divide_checked | floating | domain error | error |
Thus we should have the following behavior for substrait calls
Call | Data Type | Options | Arrow Result |
---|---|---|---|
divide | integral | overflow=SILENT | divide |
divide | integral | overflow=SATURATE | reject plan |
divide | integral | overflow=ERROR | divide_checked |
divide | floating | rounding=* | reject plan[1] |
divide | floating | on_domain_error=NAN | divide |
divide | floating | on_domain_error=ERROR | divide_checked |
divide | floating | on_division_by_zero=LIMIT | divide |
divide | floating | on_division_by_zero=NAN | reject plan |
divide | floating | on_division_by_zero=ERROR | divide_checked |
divide | floating | on_domain_error=NAN on_division_by_zero=ERROR | reject plan[2] |
divide | floating | on_domain_error=ERROR on_division_by_zero=LIMIT | reject plan[2] |
- I'm not sure what rounding behavior we have implemented in Acero and therefore the safest thing to do would be to reject all plans that are asking for a specific rounding behavior.
- We cannot satisfy these specific combinations between divide and divide_checked.
Caveat / Needs clarification:
IMHO, It is not clear, in Substrait, if division by zero for integral arguments counts as overflow or not. Also, it is possible to have 0/0 in integer arithmetic, which I have been interpreting as "domain error" (technically division by zero requires non-zero/0) but have no idea what the integral behavior should be.
I'm not sure what rounding behavior we have implemented in Acero and therefore the safest thing to do would be to reject all plans that are asking for a specific rounding behavior.
Assuming that the arithmetic operations are all ultimately implemented using standard library operations, you can query the current mode with fegetround()
(it's a global processor state thing). But unless an Arrow user goes out of their way to change the mode and this somehow propagates to all threads correctly, it's just going to be FE_TONEAREST
, which corresponds with Substrait TIE_TO_EVEN
(TIE_AWAY_FROM_ZERO
does not seem to be implemented by the standard library, the rest of the mapping should be fairly straightforward).
ETA: at least... that's how it works for the GNU C library. Looking at the C++ documentation better it might only affect some functions explicitly related to rounding. But in any case, any sane processor architecture and floating-point library implementation will be using TIE_TO_EVEN
unless otherwise specified, because it is the rounding mode that is least likely to be statistically biased for practical algorithms. It seems like a safe bet to me to accept both unspecified and TIE_TO_EVEN
.
It seems like a safe bet to me to accept both unspecified and TIE_TO_EVEN.
Worst case we find out when we get around to implementing Substrait function testing for this specific option :laughing: Nightmare fuel for @richtia
Pinged @lidavidm / @pitrou for a brief review of the Exp kernel. You can probably ignore the Substrait part if you want.
@drabastomek can I help move this forward?
@rok Yes, I think that would be helpful. I had agreed with @drabastomek offline to finish this up for him but hadn't gotten around to it yet. If you have a few cycles to address the review comments that would be appreciated!
@westonpace @rok Thanks guys! I had this on my radar for this week. I addressed the comments but would be great if @rok could have another look -- much appreciated!
@pitrou I've agreed with @drabastomek to implement the review feedback. Could you please take another look?
@pitrou do you think this needs another pass?
Nice! Thanks @drabastomek & all!
Benchmark runs are scheduled for baseline = 0f87e6bf89211a989e4b77990b35d157655ea1a8 and contender = b5b0282516e04fd27cf59f3c05e849967d997cb4. b5b0282516e04fd27cf59f3c05e849967d997cb4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed :arrow_down:0.0% :arrow_up:0.0%] ec2-t3-xlarge-us-east-2
[Failed :arrow_down:0.74% :arrow_up:0.0%] test-mac-arm
[Failed :arrow_down:0.0% :arrow_up:0.0%] ursa-i9-9960x
[Failed :arrow_down:0.07% :arrow_up:0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] b5b02825
ec2-t3-xlarge-us-east-2
[Failed] b5b02825
test-mac-arm
[Failed] b5b02825
ursa-i9-9960x
[Failed] b5b02825
ursa-thinkcentre-m75q
[Failed] 0f87e6bf
ec2-t3-xlarge-us-east-2
[Finished] 0f87e6bf
test-mac-arm
[Finished] 0f87e6bf
ursa-i9-9960x
[Finished] 0f87e6bf
ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
Thank you all for the support with this!