spark-rapids Support float/double castings for ORC reading.

Close #6291, (which is sub-issue of https://github.com/NVIDIA/spark-rapids/issues/6149 ).

To implement casting float/double to {bool, integer types, double/float, string, timestamp}.

double is also known as float64. Integer types include int8/16/32/64.

Implementation

Casting	Implementation Description
`float/double -> {bool, int8/16/32/64}`	1. First replace rows that cannot fit in long with nulls. 2. Convert the ColumnVector to Long type 3. Down cast long to the target integral type.
`float <-> double`	1. Call `ColumnView.castTo`. 2. When casting `double -> float`, if `double` value is greater than `FLOAT_MAX`, then mark this value with `Infinite`.
`float/double -> string`	1. cuDF keep 9 decimal numbers after the decimal point, and CPU keeps more than 10. 2. Added a config item `spark.rapids.sql.format.orc.floatTypesToString.enable` (default value is true) to control whether if we can cast `float/double -> string` while reading ORC.
`float/double -> timestamp`	1. ORC assumes the original `float/double` values are in seconds. 2. If `ROUND(val * 1000) > LONG_MAX` , replace it with null, e.g. `val = 1e20`. Otherwise, keep these values, and convert them into milli-seonds vector. 3. Multiply 1000, convert them into micro-seconds vector. Pay attention to long(INT64) overflow here, since timestamp is stored in `INT64`.

Aug 15 '22 10:08 sinkinben

For the precision problem of float/double -> string, there exists a similar operation in sql-cast, e.g. "select cast(float_col as string) from table".

And in sql-cast, it's controlled by a conf:

lazy val isCastFloatToStringEnabled: Boolean = get(ENABLE_CAST_FLOAT_TO_STRING)
val ENABLE_CAST_FLOAT_TO_STRING = conf("spark.rapids.sql.castFloatToString.enabled")
    .doc("Casting from floating point types to string on the GPU returns results that have " +
      "a different precision than the default results of Spark.")
    .booleanConf
    .createWithDefault(true)

In GpuCast.scala, it will check the plan tree in the recursive way.

private def recursiveTagExprForGpuCheck(...) {
  ...
  case (_: FloatType | _: DoubleType, _: StringType) if !conf.isCastFloatToStringEnabled =>
    willNotWorkOnGpu("the GPU will use different precision than Java's toString method when " +
        "converting floating point data types to strings and this can produce results that " +
        "differ from the default behavior in Spark.  To enable this operation on the GPU, set" +
        s" ${RapidsConf.ENABLE_CAST_FLOAT_TO_STRING} to true.")
}

I think we can handle the precision problem of float/double->string in a similar way here.

Aug 16 '22 09:08 sinkinben

build

Sep 06 '22 17:09 revans2

spark-rapids spark-rapids copied to clipboard

Support float/double castings for ORC reading.

Implementation

spark-rapids
spark-rapids copied to clipboard