cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[FEA] Add option to read JSON field as unparsed string

Open andygrove opened this issue 2 years ago • 4 comments

Is your feature request related to a problem? Please describe.

When reading JSON in Spark, if a field has mixed types, Spark will infer the type as String to avoid data loss due to the uncertainty of the actual data type.

For example, given this input file, Spark will read column bar as a numeric type and column foo as a string type.

$ cat test.json
{ "foo": [1,2,3], "bar": 123 }
{ "foo": { "a": 1 }, "bar": 456 }

Here is the Spark code that demonstrates this:

scala> val df = spark.read.json("test.json")
df: org.apache.spark.sql.DataFrame = [bar: bigint, foo: string]                 

scala> df.show
+---+-------+
|bar|    foo|
+---+-------+
|123|[1,2,3]|
|456|{"a":1}|
+---+-------+

Currently, Spark RAPIDS fails for this example because cuDF does not support mixed types in a column:

Caused by: ai.rapids.cudf.CudfException: CUDF failure at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-181-cuda11/thirdparty/cudf/cpp/src/io/json/json_column.cu:577: A mix of lists and structs within the same column is not supported
  at ai.rapids.cudf.Table.readJSON(Native Method)

Describe the solution you'd like I would like the ability to specify to read certain columns as unparsed strings.

Describe alternatives you've considered I am also exploring some workarounds in the Spark RAPIDS plugin.

Additional context

andygrove avatar Sep 29 '23 21:09 andygrove

We have some code that @ttnghia wrote. It will convert a range of tokens to a normalized string that matches what Spark wants. We did this for some Spark specific functionality with JSON parsing related to returning a Map instead of a Struct.

https://github.com/NVIDIA/spark-rapids-jni/blob/54ef9991f46fa873d580315212aeae345da7152a/src/main/cpp/src/map_utils.cu#L63-L112

I am not sure if this is really something that CUDF wants, but it is at least a starting point.

revans2 avatar Oct 02 '23 21:10 revans2

Here are some examples, showing input and expected output.

# Example 1: Mixed primitive types in struct

INPUT:

{ "a": "123" }
{ "a": 123 }

EXPECTED:

+-----------+
|    my_json|
+-----------+
|{"a":"123"}|
|{"a":"123"}|
+-----------+

# Example 2: Mixed structs and lists in struct

INPUT:

{ "a": [1,2,3] }
{ "a": { "b": 1 } }

EXPECTED:

+-----------------+
|          my_json|
+-----------------+
|  {"a":"[1,2,3]"}|
|{"a":"{\"b\":1}"}|
+-----------------+

# Example 3: Mixed structs and primitives in struct

INPUT:

{ "a": "fox" }
{ "a": { "b": 1 } }

EXPECTED:

+-----------------+
|my_json          |
+-----------------+
|{"a":"fox"}      |
|{"a":"{\"b\":1}"}|
+-----------------+

# Example 4: Mixed lists and primitives in struct

INPUT:

{ "a": [1,2,3] }
{ "a": "fox" }

EXPECTED:

+---------------+
|my_json        |
+---------------+
|{"a":"[1,2,3]"}|
|{"a":"fox"}    |
+---------------+

andygrove avatar Dec 07 '23 16:12 andygrove

There is a separate use case for arrays where the array element type differs between records. Spark infers the type as Array<String> in this case.

This is not necessarily a high priority and could be split out into a separate issue, but I'd like to point it out here for visibility.

# Example: Mixed primitive arrays in struct

INPUT:

{ "a": [1,2,3] }
{ "a": [true,false,true] }
{ "a": ["a", "b", "c"] }

EXPECTED:

+-----------------------------+
|my_json                      |
+-----------------------------+
|{"a":["1","2","3"]}          |
|{"a":["true","false","true"]}|
|{"a":["a","b","c"]}          |
+-----------------------------+

andygrove avatar Dec 07 '23 16:12 andygrove

We made significant progress on this issue with #14572, and I believe we will be able to close it after #14936. @andygrove would you please let us know if there are other cases to consider?

GregoryKimball avatar Feb 16 '24 21:02 GregoryKimball

For all the examples in https://github.com/rapidsai/cudf/issues/14239#issuecomment-1845680168, I see the correct results with https://github.com/rapidsai/cudf/pull/14936.

For the mixed array example in https://github.com/rapidsai/cudf/issues/14239#issuecomment-1845685377 I still do not see the correct results, so I filed a separate issue for this one (https://github.com/rapidsai/cudf/issues/15120).

andygrove avatar Feb 22 '24 22:02 andygrove