spark-rapids
spark-rapids copied to clipboard
Support float/double castings for ORC reading.
Close #6291, (which is sub-issue of https://github.com/NVIDIA/spark-rapids/issues/6149 ).
To implement casting float/double to {bool, integer types, double/float, string, timestamp}.
double is also known as float64. Integer types include int8/16/32/64.
Implementation
| Casting | Implementation Description |
|---|---|
float/double -> {bool, int8/16/32/64} |
1. First replace rows that cannot fit in long with nulls. 2. Convert the ColumnVector to Long type 3. Down cast long to the target integral type. |
float <-> double |
1. Call ColumnView.castTo. 2. When casting double -> float, if double value is greater than FLOAT_MAX, then mark this value with Infinite. |
float/double -> string |
1. cuDF keep 9 decimal numbers after the decimal point, and CPU keeps more than 10. 2. Added a config item spark.rapids.sql.format.orc.floatTypesToString.enable (default value is true) to control whether if we can cast float/double -> string while reading ORC. |
float/double -> timestamp |
1. ORC assumes the original float/double values are in seconds.2. If ROUND(val * 1000) > LONG_MAX , replace it with null, e.g. val = 1e20. Otherwise, keep these values, and convert them into milli-seonds vector.3. Multiply 1000, convert them into micro-seconds vector. Pay attention to long(INT64) overflow here, since timestamp is stored in INT64. |
For the precision problem of float/double -> string, there exists a similar operation in sql-cast, e.g. "select cast(float_col as string) from table".
And in sql-cast, it's controlled by a conf:
lazy val isCastFloatToStringEnabled: Boolean = get(ENABLE_CAST_FLOAT_TO_STRING)
val ENABLE_CAST_FLOAT_TO_STRING = conf("spark.rapids.sql.castFloatToString.enabled")
.doc("Casting from floating point types to string on the GPU returns results that have " +
"a different precision than the default results of Spark.")
.booleanConf
.createWithDefault(true)
In GpuCast.scala, it will check the plan tree in the recursive way.
private def recursiveTagExprForGpuCheck(...) {
...
case (_: FloatType | _: DoubleType, _: StringType) if !conf.isCastFloatToStringEnabled =>
willNotWorkOnGpu("the GPU will use different precision than Java's toString method when " +
"converting floating point data types to strings and this can produce results that " +
"differ from the default behavior in Spark. To enable this operation on the GPU, set" +
s" ${RapidsConf.ENABLE_CAST_FLOAT_TO_STRING} to true.")
}
I think we can handle the precision problem of float/double->string in a similar way here.
build