spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

Support float/double castings for ORC reading.

Open sinkinben opened this issue 3 years ago • 1 comments

Close #6291, (which is sub-issue of https://github.com/NVIDIA/spark-rapids/issues/6149 ).

To implement casting float/double to {bool, integer types, double/float, string, timestamp}.

double is also known as float64. Integer types include int8/16/32/64.

Implementation

Casting Implementation Description
float/double -> {bool, int8/16/32/64} 1. First replace rows that cannot fit in long with nulls.
2. Convert the ColumnVector to Long type
3. Down cast long to the target integral type.
float <-> double 1. Call ColumnView.castTo.
2. When casting double -> float, if double value is greater than FLOAT_MAX, then mark this value with Infinite.
float/double -> string 1. cuDF keep 9 decimal numbers after the decimal point, and CPU keeps more than 10.
2. Added a config item spark.rapids.sql.format.orc.floatTypesToString.enable (default value is true) to control whether if we can cast float/double -> string while reading ORC.
float/double -> timestamp 1. ORC assumes the original float/double values are in seconds.
2. If ROUND(val * 1000) > LONG_MAX , replace it with null, e.g. val = 1e20. Otherwise, keep these values, and convert them into milli-seonds vector.
3. Multiply 1000, convert them into micro-seconds vector. Pay attention to long(INT64) overflow here, since timestamp is stored in INT64.

sinkinben avatar Aug 15 '22 10:08 sinkinben

For the precision problem of float/double -> string, there exists a similar operation in sql-cast, e.g. "select cast(float_col as string) from table".

And in sql-cast, it's controlled by a conf:

lazy val isCastFloatToStringEnabled: Boolean = get(ENABLE_CAST_FLOAT_TO_STRING)
val ENABLE_CAST_FLOAT_TO_STRING = conf("spark.rapids.sql.castFloatToString.enabled")
    .doc("Casting from floating point types to string on the GPU returns results that have " +
      "a different precision than the default results of Spark.")
    .booleanConf
    .createWithDefault(true)

In GpuCast.scala, it will check the plan tree in the recursive way.

private def recursiveTagExprForGpuCheck(...) {
  ...
  case (_: FloatType | _: DoubleType, _: StringType) if !conf.isCastFloatToStringEnabled =>
    willNotWorkOnGpu("the GPU will use different precision than Java's toString method when " +
        "converting floating point data types to strings and this can produce results that " +
        "differ from the default behavior in Spark.  To enable this operation on the GPU, set" +
        s" ${RapidsConf.ENABLE_CAST_FLOAT_TO_STRING} to true.")
}

I think we can handle the precision problem of float/double->string in a similar way here.

sinkinben avatar Aug 16 '22 09:08 sinkinben

build

revans2 avatar Sep 06 '22 17:09 revans2