spark-rapids
spark-rapids copied to clipboard
[FEA] Fully implement multiply and divide for decimal128
Is your feature request related to a problem? Please describe. This is a bit of a crazy idea. I already had a crazy idea about SUM #3944. This is similar but for multiply and divide.
There are algorithms for both multiply and divide that work for large arbitrary precision.
https://en.wikipedia.org/wiki/Multiplication_algorithm
https://en.wikipedia.org/wiki/Division_algorithm
We also would need to write out own arbitrary precision rounding algorithm, which we could base off of what we and CUDF do already or try to come up with something new and more efficient.
I think we can adapt these to work with the operations we already have access to on the GPU. If we had to we could also write custom cuda kernels to take the inputs. Do the operations as 256-bit integer values, and then round the result back.
This is not as fully thought through as the SUM operation is. This is because there are a number of possible algorithms. I am no expert in this so I think we could start with something simpler that we know we can implement, even if it is not the most memory/compute efficient, and then we can refine it later if needed.
Lets take multiply as an example and to keep things simple in the example lets say we only can support decimal values up to a precision of 4. In this example we will do Decimal(4, 2) * Decimal(4, 2) and get an output of Decimal(4, 1) (trying to emulate what Spark does with overflow). We will do the math for 10.27 * 0.51.
The Spark way.
- Do the multiply with arbitrary precision. 10.27 * 0.51 => 5.2377
- Round the result to the desired scale 5.2377 => 5.2
- Check if the result fits in the desired precision. if (999.9 >= 5.2 >= -999.9) 5.2 else null
For us we don't have arbitrary precision so we can do a long multiply with some adds and existing multiply operations.
- Convert the numbers to "integers" (scale 0) and split them up into values we know we can multiply without overflowing. 10.27 => 1027 => 10 and 27 0.51 => 51 => 00 and 51
- Do the 4 multiplies with numbers we know that we can do without overflow, and keep track of the multiply digits we would need 00 * 10 = 0000 * 10^(6 - 4) // the 6 comes from the fact it was the two higher order numbers we multiplied. The -4 is because that is the resulting scale we would see after doing the original multiply 00 * 27 = 0000 * 10^(4 - 4) 10 * 51 = 0510 * 10^(2 - 4) 27 * 51 = 1377 * 10^(0 - 4)
- Do some initial overflow checks. Essentially do the same kind of overflow checks we do now for max and min values that can fit in the output, but we scale it to match the scale of the partial answer. 999.9 >= 0000 * 10^2 >= -999.9 999.9 >= 0000 >= -999.9 999.9 >= 05.10 >= -999.9 999.9 >= .1377 >= -999.9
If any of them would overflow we know that the final result would also overflow.
-
add the results and round/truncate as we go. We start by adding all of the numbers together that have values below the desired output scale. I think with how Spark works we can just truncate the lower number to the precision we need before rounding, but I am not 100% sure on that. Will need to do some experimentation to see. .1377 truncate to scale 2 => .13 .13 + 05.10 => 05.23 round to the final precision 05.2
-
Move the remaining values to the final output scale and add them together. The rounding could have added another value to the maximum precision for the result. But because we know that a decimal128 has enough space to hold this without really overflowing we don't need to worry about it because we already checked the ranges ahead of time.
0000 * 10^2 + 0000.0 + 05.2 => 5.2
- finally check the range of the result, because we did rounding.
This is already long enough, but I think we can do something similar for divide, and I think we can also reduce the complexity of multiply with a little work.
Removing from 22.02 until we have a more clear customer need.
This blocks issues https://github.com/NVIDIA/spark-rapids/issues/6142 and https://github.com/NVIDIA/spark-rapids/issues/6143
Multiply looks relatively simple to implement. Divide however is much more difficult.
Sadly, it looks like in the worst case we are going to need more than 256-bits.
- 192-bits can hold 57 singed digits (see below for average)
- 256-bits can hold 76 signed digits
- 384-bits can hold 115 signed digits (see below for divide in general)
- 512-bits can hold 153 signed digits
For divide we need to have the large values because we have to shift the dividend/lhs over to make enough room so the output gets the precision/scale it needs + 1. The + 1 is so we can round the result afterwards to match Spark. In the worst case the output is (38, 38) the dividend/lhs is (38, 0) and the divisor/rhs is (38, 38). This is highly contrived and unrealistic but possible. So, if we can support it we should try to.This would make it, so we had to shift the lhs to have a scale of 77 = 38 + 38 + 1 and a precision of 115 = 77 + 38.
https://github.com/NVIDIA/spark-rapids/blob/528c67b2215589e0dc0002b8975970a07db4706b/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/arithmetic.scala#L851-L863).
We might be able to play some games if we can get the remainder out of the divide algorithm.
if (remainder >= dividend/2) {
output += 1
}
With that we would only need a precision of 114 and scale of 76. But 384-bits can support everything that we need so there really isn't a lot of reason to do it unless it makes the rounding work much much simpler. Not sure that what I am proposing actually will work.
For average #6142 we know that the divisor/rhs is a long so it would end up being a (20, 0) and the output is the input to average with 4 added to the scale, but bound to 38. So worst case we would end up with an output of (38, 38), a dividend/lhs of (38, 34) and a divisor/rhs of (20, 0). This means the lhs would need to be shifted to have a scale of 39 and a precision of 43.
Not sure if we want to have multiple kernels for different sizes of just one. We probably can start with the hardest one first and then see what kind of a performance boost we get for smaller sizes.
The good news is that in either case the rhs stays as a smaller value 128-bits at most, but could be 64-bits or even 32-bits. This allows us to potentially save a lot of register space when doing the divide. Still need to think about how this might impact the choice we have to algorithms. 384-bits is still a lot.
The basics of multiply and divide are in place now.