spark-rapids
spark-rapids copied to clipboard
[FEA] Support reading JSON numeric values as timestamps and dates
Is your feature request related to a problem? Please describe. PR https://github.com/NVIDIA/spark-rapids/pull/4938 adds support for reading CSV and JSON strings as timestamps and it supports valid timestamps formatted in a number of timestamp formats consisting of year, month, day, hours, minutes, seconds, and so on. However, it does not support parsing numeric values (UNIX timestamps) representing elapsed time since the epoch.
Describe the solution you'd like Add support for GpuJsonScan for reading UNIX timestamps.
Describe alternatives you've considered None
Additional context See existing tests that reference this issue
@andygrove @jlowe @revans2 Help to check:
- Spark does not support reading CSV numeric UNIX timestamps as timestamps.
- Spark does support reading JSON numeric UNIX timestamps as timestamps.
But we have difficulty supporting this on GPU.
Spark behavoir:
string"1653014790" => nullint1653014790 =>2022-05-20 10:46:30If one column has both above values, we can't know the original types if read this column as String by cuDF, because we will get two string rows with the same value.
Details:
Spark does not support reading CSV numeric UNIX timestamps as timestamps
$SPARK_HOME/bin/pyspark
from pyspark.sql.types import *
schema = StructType([StructField("c1", TimestampType())])
spark.read.csv("/tmp/tmp.csv").show()
+----------+
| _c0|
+----------+
|1653012031|
| 197|
+----------+
spark.read.schema(schema).csv("/tmp/tmp.csv").show()
+----+
| c1|
+----+
|null|
|null|
+----+
Spark read JSON string int and int
Spark code:
Spark uses VALUE_STRING and VALUE_NUMBER_INT tokens to distinguish string and int
https://github.com/apache/spark/blob/v3.3.0-rc2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L251toL266
case TimestampType =>
(parser: JsonParser) => parseJsonToken[java.lang.Long](parser, dataType) {
case VALUE_STRING if parser.getTextLength >= 1 =>
try {
timestampFormatter.parse(parser.getText)
} catch {
case NonFatal(e) =>
// If fails to parse, then tries the way used in 2.0 and 1.x for backwards
// compatibility.
val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(parser.getText))
DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
}
case VALUE_NUMBER_INT =>
parser.getLongValue * 1000000L
}
Different behavior for string int and int
cat tmp.json
{ "c1": "2020-01-01 00:00:00" }
{ "c1": 1653014790 } // Valid int
$SPARK_HOME/bin/pyspark
from pyspark.sql.types import *
schema = StructType([StructField("c1", TimestampType())])
spark.read.schema(schema).json("/tmp/tmp.json").show()
+-------------------+
| c1|
+-------------------+
|2020-01-01 00:00:00|
|2022-05-20 10:46:30| // Valid timestamp
+-------------------+
cat tmp2.json
{ "c1": "2020-01-01 00:00:00" }
{ "c1": "1653014790" } // Note: it's a string, not int
spark.read.schema(schema).json("/tmp/tmp2.json").show()
+-------------------+
| c1|
+-------------------+
|2020-01-01 00:00:00|
| null| // Note
+-------------------+
cuDF reads as long
json
{ "c1": "2020-01-13" }
{ "c1": 1653014790 }
{ "c1": 0 }
result:
0 isNull = false // IMO, this should be false
1653014790 isNull = false
0 isNull = false // can't distinguish with the first row
0 indicates the epoch 1970-01-01 00:00:00
@res-life I will take a look at this later today
@res-life Yes this is a known issue with JSON parsing. Spark distinguishes between quoted fields and non-quoted fields. For now we are just documenting this
https://github.com/NVIDIA/spark-rapids/blob/branch-22.06/docs/compatibility.md#json-floating-point
Another limitation of the GPU JSON reader is that it will parse strings containing non-string boolean or numeric values where Spark will treat them as invalid inputs and will just return null.
Although it should be moved to its own section to make it more visible.
We hope that once CUDF finishes work on the new JSON parser that we can get some help from them to indicate if a field was quoted or not.
Removing from 22.08. Will put it into a release once the cudf dependencies are resolved.
It's not planned in 22.12, let me unassign myself.
I have failed to find a case that says Spark CSV can read numeric values as timestamps, or any related config to enable this behavior.
Maybe we can remove the CSV word from the title, since this issue appears to be specific to JSON reader.
@andygrove Do you have any concern for removing the 'CSV' word?
I am going to remove the CSV word from the title.
If any concern, free to add it back.