feat: Support log for Decimal32 and Decimal64
Which issue does this PR close?
- Part of #17555 .
Rationale for this change
Analysis
Other engines:
- Clickhouse seems to only consider
"(U)Int*", "Float*", "Decimal*"as arguments for log https://github.com/ClickHouse/ClickHouse/blob/master/src/Functions/log.cpp#L47-L63
Libraries
- There a C++ library libdecimal which internally uses Intel Decimal Floating Point Library for it's decimal32 operations. Intel's library itself converts the decimal32 to double and calls
log. https://github.com/karlorz/IntelRDFPMathLib20U2/blob/main/LIBRARY/src/bid32_log.c - There was another C++ library based on IBM's decimal decNumber library https://github.com/semihc/CppDecimal . This one's implementation of
logis fully using decimal, but I don't think this would be very performant way to do this
I'm going to go with an approach similar to the one inside Intel's decimal library. To begin with the decimal32 -> double is done by a simple scaling
What changes are included in this PR?
- Support Decimal32 for log
Are these changes tested?
Yes, unit tests have been added, and I've tested this from the datafusion cli for Decimal32
> select log(2.0, arrow_cast(12345.67, 'Decimal32(9, 2)'));
+-----------------------------------------------------------------------+
| log(Float64(2),arrow_cast(Float64(12345.67),Utf8("Decimal32(9, 2)"))) |
+-----------------------------------------------------------------------+
| 13.591717513271785 |
+-----------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.021 seconds.
Are there any user-facing changes?
- The precision of the result for Decimal32 will change, the precision loss in #18524 does not occur in this PR
I'm still working on the Decimal64, but early feedback on the PR is much appreciated
Interesting, I didn't realise negative scales aren't allowed. I assumed they were as arrow allows negative scales in decimal. https://github.com/apache/arrow-rs/blob/main/arrow-schema/src/datatype.rs#L359-L372
Interesting, I didn't realise negative scales aren't allowed. I assumed they were as arrow allows negative scales in decimal. https://github.com/apache/arrow-rs/blob/main/arrow-schema/src/datatype.rs#L359-L372
Negative scales are allowed; I believe any places in our codebase they are disallowed is mainly due to implementation limitation (i.e. not yet supported) rather than inherently not being possible.
(Haven't had a chance to review this PR yet, hopefully soon)
Now that I think about it, how would this implementation be different from having our coercion/casting logic convert the input decimal arrays to floats before applying the log
My take here is that it would depend on the function. Let's say something round, ceil, floor a coercion is unacceptable, where as with things like log, sin the results would not be precise and so a coercion is acceptable
We can doing it entirely in decimal like this one based on IBM's version https://github.com/semihc/CppDecimal/blob/main/src/decNumber.c#L1384-L1518
But I think this would be a rather expensive version. Also from https://github.com/apache/datafusion/issues/18524 if we want
select log(2.0, 100000000000000000000000000000000000::decimal(38,0)); to be 116.267483321058 we need the result of log to be float (which it currently is), otherwise I think the result has to be 116 (since it's (38, 0))
Another thing is Intel's decimal library does convert decimal -> binary64 then back to decimal. The conversion itself is a bit more sophisticated than a simple (N / 10*scale) https://github.com/karlorz/IntelRDFPMathLib20U2/blob/main/LIBRARY/src/bid32_log.c
We can port their conversion logic here if needed, I wanted to get some feedback on this PR before that. Let me know your suggestion on this
In the original PR that kicked off this effort (#17023) it converts the decimal128 to the native i128 representation before doing an integer log, as converting to f64 apparently causes some precision loss. I think we should follow suit as otherwise there is little difference than previous behaviour of casting to float64 first before doing the log 🤔
It also makes me wonder if we should handle negative scale by just casting to float to do the log so we don't lose functionality.
Thanks for looking into the other solutions from IBM and Intel; I think we can avoid porting/copying their code unless there is a strong need for what they bring to the table for us.
Ok, instead of converting to float I'll keep it as integers and perform and integer log.
Just one thing though the log function in DuckDB and Clickhouse return float64/double so the behaviour might different (but I personally think that's fine, as the user will not have an implicit type conversion from decimal to float). And if we are returning as decimal then the following is expected right? across Decimal32, Decimal64, Decimal128, Decimal256
| Query | Res |
|---|---|
| select log(12345::decimal(38,0)) | 4.0 |
| select log(12345::decimal(38,2)) | 4.09 |
I went through the Intel and IBM solution to understand potential edge cases which could affect precision. I won't be porting anything from them unless it's truly needed, point noted
Ok, instead of converting to float I'll keep it as integers and perform and integer log.
Just one thing though the
logfunction in DuckDB and Clickhouse returnfloat64/doubleso the behaviour might different (but I personally think that's fine, as the user will not have an implicit type conversion from decimal to float). And if we are returning as decimal then the following is expected right? across Decimal32, Decimal64, Decimal128, Decimal256
Query Res select log(12345::decimal(38,0)) 4.0 select log(12345::decimal(38,2)) 4.09 I went through the Intel and IBM solution to understand potential edge cases which could affect precision. I won't be porting anything from them unless it's truly needed, point noted
I think we should follow what the decimal128 version is doing: it performs ilog on the scaled i128 and then converts it to f64 to return. We aren't returning log of decimal as log, we're still converting it to f64.
The idea is that doing the log as ilog on scaled decimal we get more accurate results than casting decimal to float before performing log on the float.
See the original decimal128 PR for reference: https://github.com/apache/datafusion/pull/17023