spark-rapids Implement all the casting cases that GPU can support for ORC reading.

There will be more than 100 cases. We may need multiple sub issues for this. Click to see full type casting list CPU ORC supports.

Jul 29 '22 02:07 firestarman

Note that for the CHAR type, casting to a string requires stripping the trailing whitespaces from the value to match the CPU behavior. See https://github.com/NVIDIA/spark-rapids/pull/6188#discussion_r935683053. It would be nice if we could ask libcudf to load the CHAR column by stripping trailing whitespace instead of adding it, so we don't have to perform a post-processing step on the CHAR columns.

Aug 02 '22 16:08 jlowe

I divide these castings into these subcategories, according to the source type.

[X] integer -> integer (It was done in #5960 .)
[x] bool/int8/int16/int32/int64 -> {string, float, double(float64), timestamp}.
- Issue #6272
- PR #6273

[x] float32 -> {bool, integer types, double, string, timestamp}
[x] double -> {bool, integer types, float32, string, timestamp}
- Issue #6291
- PR #6319

[ ] [Pending] date -> {string, timestamp}
- Issue #6357
- branch: orc-cast-date

[ ] [Pending] string -> {bool, integer types, float32, double, timestamp, date}
- Issue #6394
- Draft PR #6411

[ ] timestamp -> {integer types, float32, double, string, date}

Two special case:

[ ] decimal <-> {bool, integer types, float, double, string, timestamp}
[ ] binary <-> string

Whitespaces of char/varchar/string should be paied attention to, which is mentioned above.

Aug 10 '22 10:08 sinkinben

A Summary of Implementation Details

Casting from Integer Types

Casting	Implementation Description
`bool -> float/double`	Based on `ColumnVector.castTo` in cuDF.
`bool -> string`	Call `castTo`, and convert them into upper cases `TRUE/FALSE` (as CPU code did).
`int8/16/32/64 -> float/double/string`	Call `castTo`
`bool/int8/16/32 -> timestamp`	The original value is in seconds, and convert them into micro-seconds. Since `timestamp` is stored in `int64`, there is no integer-overflow.
`int64 -> timestamp`	1. From spark311 until spark320 (inluding 311, 312, 313, 314), they consider the integers as milliseconds when casting integers to timestamp. 2. For spark320+ (including spark320), they consider the integers as seconds. 3. For both cases, convert them to microseconds.

Casting from Float types

Casting	Implementation Description
`float/double -> {bool, int8/16/32/64}`	1. First replace rows that cannot fit in long with nulls. 2. Convert the ColumnVector to Long type 3. Down cast long to the target integral type.
`float <-> double`	1. Call `ColumnView.castTo`. 2. When casting `double -> float`, if `double` value is greater than `FLOAT_MAX`, then mark this value with `Infinite`.
`float/double -> string`	1. cuDF keep 9 decimal numbers after the decimal point, and CPU keeps more than 10. 2. Added a config item `spark.rapids.sql.format.orc.floatTypesToString.enable` (default value is true) to control whether if we can cast `float/double -> string` while reading ORC.
`float/double -> timestamp`	1. ORC assumes the original `float/double` values are in seconds. 2. If `ROUND(val * 1000) > LONG_MAX` , replace it with null, e.g. `val = 1e20`. Otherwise, keep these values, and convert them into milli-seonds vector. 3. Multiply 1000, convert them into micro-seconds vector. Pay attention to long(INT64) overflow here, since timestamp is stored in `INT64`.

Casting from string

Casting	Implementation Description
`string -> bool/int8/16/32/64`	1. Check the pattern of input strings by regular expression, replace invalid strings with `null`. 2. Follow the CPU ORC conversion. Firstly convert string to long, then down cast long to target integral type. 3. For `string -> bool`, cases `"true","false` are invalid, they should be `"0", "1"`.
`string -> float/double`	1. Check the pattern of input strings by regex. The leading/trailing spaces should be ignored. Replace the invalid strings with `null`. 2. Call `castTo(FLOAT)` or `castTo(DOUBLE)` in cuDF.
`string -> date/timestamp`	Working on it.

Casting from Date types (TODO)

Casting	Implementation Description
`date -> string`	Call `ColumnView.asString()`, and it will call `asStrings("%Y-%m-%d")` inside.
`date -> timestamp`	1. Convert the `date` columnar vector into `INT64` type. 2. Multiply it with `24 * 60 * 60 * 1e6`, and then convert it into `TIMESTAMP_MICROSECONDS`.

However, there are still some issues. For more details, see the comments in #6357 .

Here is the Code branch.

Aug 17 '22 09:08 sinkinben

As the discussion mentioned in https://github.com/apache/orc/issues/1237,

Both Apache Spark and ORC community recommend to use explicit SQL CAST method instead of depending on data source's Schema Evolution.

That is we can replace Schema evolution with CAST in SQL.

For example, if we have an ORC file, it contains one column date_str, and date_str is some strings with a pattern of YYYY-mm-dd.

# Read `date_str` in type of string, do not use schema evolution
scala> var df = spark.read.schema("date_str string").orc("/tmp/orc/data.orc");
scala> df.show()
+----------+
|  date_str|
+----------+
|2002-01-01|
|2022-08-29|
|2022-08-31|
|2022-01-32|
|9808-02-30|
|2022-06-31|
+----------+
# Cast `date_str` to type of `date`, using SQL-CAST
scala> df.registerTempTable("table")
scala> df.sqlContext.sql("select CAST(date_str as date) from table").show()
+----------+
|  date_str|
+----------+
|2002-01-01|
|2022-08-29|
|2022-08-31|
|      null|
|      null|
|      null|
+----------+

Aug 30 '22 06:08 sinkinben

spark-rapids spark-rapids copied to clipboard

Implement all the casting cases that GPU can support for ORC reading.

A Summary of Implementation Details

spark-rapids
spark-rapids copied to clipboard