spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[FEA] Support reading JSON numeric values as timestamps and dates

Open andygrove opened this issue 3 years ago • 7 comments

Is your feature request related to a problem? Please describe. PR https://github.com/NVIDIA/spark-rapids/pull/4938 adds support for reading CSV and JSON strings as timestamps and it supports valid timestamps formatted in a number of timestamp formats consisting of year, month, day, hours, minutes, seconds, and so on. However, it does not support parsing numeric values (UNIX timestamps) representing elapsed time since the epoch.

Describe the solution you'd like Add support for GpuJsonScan for reading UNIX timestamps.

Describe alternatives you've considered None

Additional context See existing tests that reference this issue

andygrove avatar Mar 11 '22 18:03 andygrove

@andygrove @jlowe @revans2 Help to check:

  • Spark does not support reading CSV numeric UNIX timestamps as timestamps.
  • Spark does support reading JSON numeric UNIX timestamps as timestamps. But we have difficulty supporting this on GPU. Spark behavoir: string "1653014790" => null int 1653014790 => 2022-05-20 10:46:30 If one column has both above values, we can't know the original types if read this column as String by cuDF, because we will get two string rows with the same value.

Details:

Spark does not support reading CSV numeric UNIX timestamps as timestamps

$SPARK_HOME/bin/pyspark 
from pyspark.sql.types import *
schema = StructType([StructField("c1", TimestampType())])

spark.read.csv("/tmp/tmp.csv").show()
+----------+
|       _c0|
+----------+
|1653012031|
|       197|
+----------+

spark.read.schema(schema).csv("/tmp/tmp.csv").show()
+----+
|  c1|
+----+
|null|
|null|
+----+

Spark read JSON string int and int

Spark code: Spark uses VALUE_STRING and VALUE_NUMBER_INT tokens to distinguish string and int

https://github.com/apache/spark/blob/v3.3.0-rc2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L251toL266

    case TimestampType =>
      (parser: JsonParser) => parseJsonToken[java.lang.Long](parser, dataType) {
        case VALUE_STRING if parser.getTextLength >= 1 =>
          try {
            timestampFormatter.parse(parser.getText)
          } catch {
            case NonFatal(e) =>
              // If fails to parse, then tries the way used in 2.0 and 1.x for backwards
              // compatibility.
              val str = DateTimeUtils.cleanLegacyTimestampStr(UTF8String.fromString(parser.getText))
              DateTimeUtils.stringToTimestamp(str, options.zoneId).getOrElse(throw e)
          }

        case VALUE_NUMBER_INT =>
          parser.getLongValue * 1000000L
      }

Different behavior for string int and int

cat tmp.json 
{ "c1": "2020-01-01 00:00:00" }
{ "c1": 1653014790 }  // Valid int

$SPARK_HOME/bin/pyspark 
from pyspark.sql.types import *
schema = StructType([StructField("c1", TimestampType())])

spark.read.schema(schema).json("/tmp/tmp.json").show()
+-------------------+
|                 c1|
+-------------------+
|2020-01-01 00:00:00|
|2022-05-20 10:46:30| // Valid timestamp
+-------------------+


cat tmp2.json 
{ "c1": "2020-01-01 00:00:00" }
{ "c1": "1653014790" }   // Note: it's a string, not int

spark.read.schema(schema).json("/tmp/tmp2.json").show()
+-------------------+
|                 c1|
+-------------------+
|2020-01-01 00:00:00|
|               null|    // Note
+-------------------+

cuDF reads as long

json
{ "c1": "2020-01-13" }
{ "c1": 1653014790 }
{ "c1": 0 }

result:
0                         isNull = false   // IMO, this should be false  
1653014790       isNull = false  
0                         isNull = false   // can't distinguish with the first row  

0 indicates the epoch 1970-01-01 00:00:00

res-life avatar May 20 '22 08:05 res-life

@res-life I will take a look at this later today

andygrove avatar May 20 '22 14:05 andygrove

@res-life Yes this is a known issue with JSON parsing. Spark distinguishes between quoted fields and non-quoted fields. For now we are just documenting this

https://github.com/NVIDIA/spark-rapids/blob/branch-22.06/docs/compatibility.md#json-floating-point

Another limitation of the GPU JSON reader is that it will parse strings containing non-string boolean or numeric values where Spark will treat them as invalid inputs and will just return null.

Although it should be moved to its own section to make it more visible.

We hope that once CUDF finishes work on the new JSON parser that we can get some help from them to indicate if a field was quoted or not.

revans2 avatar May 20 '22 14:05 revans2

Removing from 22.08. Will put it into a release once the cudf dependencies are resolved.

sameerz avatar Jul 07 '22 23:07 sameerz

It's not planned in 22.12, let me unassign myself.

res-life avatar Oct 18 '22 03:10 res-life

I have failed to find a case that says Spark CSV can read numeric values as timestamps, or any related config to enable this behavior. Maybe we can remove the CSV word from the title, since this issue appears to be specific to JSON reader.

@andygrove Do you have any concern for removing the 'CSV' word?

firestarman avatar Oct 31 '22 08:10 firestarman

I am going to remove the CSV word from the title. If any concern, free to add it back.

firestarman avatar Nov 01 '22 01:11 firestarman