spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

Implement all the casting cases that GPU can support for ORC reading.

Open firestarman opened this issue 2 years ago • 3 comments

There will be more than 100 cases. We may need multiple sub issues for this. Click to see full type casting list CPU ORC supports.

firestarman avatar Jul 29 '22 02:07 firestarman

Note that for the CHAR type, casting to a string requires stripping the trailing whitespaces from the value to match the CPU behavior. See https://github.com/NVIDIA/spark-rapids/pull/6188#discussion_r935683053. It would be nice if we could ask libcudf to load the CHAR column by stripping trailing whitespace instead of adding it, so we don't have to perform a post-processing step on the CHAR columns.

jlowe avatar Aug 02 '22 16:08 jlowe

I divide these castings into these subcategories, according to the source type.

  • [X] integer -> integer (It was done in #5960 .)
  • [x] bool/int8/int16/int32/int64 -> {string, float, double(float64), timestamp}.
    • Issue #6272
    • PR #6273

  • [x] float32 -> {bool, integer types, double, string, timestamp}
  • [x] double -> {bool, integer types, float32, string, timestamp}
    • Issue #6291
    • PR #6319


  • [ ] [Pending] string -> {bool, integer types, float32, double, timestamp, date}
    • Issue #6394
    • Draft PR #6411

  • [ ] timestamp -> {integer types, float32, double, string, date}

Two special case:

  • [ ] decimal <-> {bool, integer types, float, double, string, timestamp}
  • [ ] binary <-> string

Whitespaces of char/varchar/string should be paied attention to, which is mentioned above.

sinkinben avatar Aug 10 '22 10:08 sinkinben

A Summary of Implementation Details

Casting from Integer Types

Casting Implementation Description
bool -> float/double Based on ColumnVector.castTo in cuDF.
bool -> string Call castTo, and convert them into upper cases TRUE/FALSE (as CPU code did).
int8/16/32/64 -> float/double/string Call castTo
bool/int8/16/32 -> timestamp The original value is in seconds, and convert them into micro-seconds. Since timestamp is stored in int64, there is no integer-overflow.
int64 -> timestamp 1. From spark311 until spark320 (inluding 311, 312, 313, 314), they consider the integers as milliseconds when casting integers to timestamp.
2. For spark320+ (including spark320), they consider the integers as seconds.
3. For both cases, convert them to microseconds.

Casting from Float types

Casting Implementation Description
float/double -> {bool, int8/16/32/64} 1. First replace rows that cannot fit in long with nulls.
2. Convert the ColumnVector to Long type
3. Down cast long to the target integral type.
float <-> double 1. Call ColumnView.castTo.
2. When casting double -> float, if double value is greater than FLOAT_MAX, then mark this value with Infinite.
float/double -> string 1. cuDF keep 9 decimal numbers after the decimal point, and CPU keeps more than 10.
2. Added a config item spark.rapids.sql.format.orc.floatTypesToString.enable (default value is true) to control whether if we can cast float/double -> string while reading ORC.
float/double -> timestamp 1. ORC assumes the original float/double values are in seconds.
2. If ROUND(val * 1000) > LONG_MAX , replace it with null, e.g. val = 1e20. Otherwise, keep these values, and convert them into milli-seonds vector.
3. Multiply 1000, convert them into micro-seconds vector. Pay attention to long(INT64) overflow here, since timestamp is stored in INT64.

Casting from string

Casting Implementation Description
string -> bool/int8/16/32/64 1. Check the pattern of input strings by regular expression, replace invalid strings with null.
2. Follow the CPU ORC conversion. Firstly convert string to long, then down cast long to target integral type.
3. For string -> bool, cases "true","false are invalid, they should be "0", "1".
string -> float/double 1. Check the pattern of input strings by regex. The leading/trailing spaces should be ignored. Replace the invalid strings with null.
2. Call castTo(FLOAT) or castTo(DOUBLE) in cuDF.
string -> date/timestamp Working on it.

Casting from Date types (TODO)

Casting Implementation Description
date -> string Call ColumnView.asString(), and it will call asStrings("%Y-%m-%d") inside.
date -> timestamp 1. Convert the date columnar vector into INT64 type.
2. Multiply it with 24 * 60 * 60 * 1e6, and then convert it into TIMESTAMP_MICROSECONDS.

However, there are still some issues. For more details, see the comments in #6357 .

Here is the Code branch.

sinkinben avatar Aug 17 '22 09:08 sinkinben

As the discussion mentioned in https://github.com/apache/orc/issues/1237,

Both Apache Spark and ORC community recommend to use explicit SQL CAST method instead of depending on data source's Schema Evolution.

That is we can replace Schema evolution with CAST in SQL.

For example, if we have an ORC file, it contains one column date_str, and date_str is some strings with a pattern of YYYY-mm-dd.

# Read `date_str` in type of string, do not use schema evolution
scala> var df = spark.read.schema("date_str string").orc("/tmp/orc/data.orc");
scala> df.show()
+----------+
|  date_str|
+----------+
|2002-01-01|
|2022-08-29|
|2022-08-31|
|2022-01-32|
|9808-02-30|
|2022-06-31|
+----------+
# Cast `date_str` to type of `date`, using SQL-CAST
scala> df.registerTempTable("table")
scala> df.sqlContext.sql("select CAST(date_str as date) from table").show()
+----------+
|  date_str|
+----------+
|2002-01-01|
|2022-08-29|
|2022-08-31|
|      null|
|      null|
|      null|
+----------+

sinkinben avatar Aug 30 '22 06:08 sinkinben