spark-rapids
spark-rapids copied to clipboard
[BUG] ScanJson and JsonToStructs cannot handle nested empty arrays/structs
Describe the bug This is with https://github.com/NVIDIA/spark-rapids/pull/10575
Seq("""{"a":[]}""").toDF("json").repartition(1).selectExpr("from_json(json, 'a array<string>')").show()
results in an error like.
Caused by: java.lang.AssertionError: Type conversion is not allowed from STRUCT(LIST(INT8)) to StructType(StructField(a,ArrayType(StringType,true),true)) expected STRUCT(LIST(STRING))
at com.nvidia.spark.rapids.GpuColumnVector.from(GpuColumnVector.java:711)
at com.nvidia.spark.rapids.GpuUnaryExpression.$anonfun$doItColumnar$1(GpuExpressions.scala:254)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at com.nvidia.spark.rapids.GpuUnaryExpression.doItColumnar(GpuExpressions.scala:250)
at com.nvidia.spark.rapids.GpuUnaryExpression.$anonfun$columnarEval$1(GpuExpressions.scala:261)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at com.nvidia.spark.rapids.GpuUnaryExpression.columnarEval(GpuExpressions.scala:260)
at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:35)
If assertions are enabled.
Similarly
Seq("""{"a":1,"b":"","c":[]}""").toDF("json").repartition(1).selectExpr("from_json(json, 'a int, b string, c array<string>')").show()
throws
Caused by: java.lang.AssertionError: Type conversion is not allowed from STRUCT(INT32,STRING,LIST(INT8)) to StructType(StructField(a,IntegerType,true),StructField(b,StringType,true),StructField(c,ArrayType(StringType,true),true)) expected STRUCT(INT32,STRING,LIST(STRING))
at com.nvidia.spark.rapids.GpuColumnVector.from(GpuColumnVector.java:711)
at com.nvidia.spark.rapids.GpuUnaryExpression.$anonfun$doItColumnar$1(GpuExpressions.scala:254)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at com.nvidia.spark.rapids.GpuUnaryExpression.doItColumnar(GpuExpressions.scala:250)
at com.nvidia.spark.rapids.GpuUnaryExpression.$anonfun$columnarEval$1(GpuExpressions.scala:261)
It looks like CUDF ignores our request that the returned value be a LIST(STRING) and returns a LIST(INT8) instead. This feels like a bug in CUDF, but we can probably work around it if we need to. But it is not going to be super simple.
I should add that an empty struct results in a different error.
Seq("""{"a":1,"b":"","c":{}}""").toDF("json").repartition(1).selectExpr("from_json(json, 'a int, b string, c struct<a string>')").show()
Caused by: java.lang.NullPointerException
at ai.rapids.cudf.Table.gatherJSONColumns(Table.java:1105)
at ai.rapids.cudf.Table.gatherJSONColumns(Table.java:1225)
at ai.rapids.cudf.Table.readJSON(Table.java:1391)
at org.apache.spark.sql.rapids.GpuJsonToStructs.$anonfun$doColumnar$2(GpuJsonToStructs.scala:180)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at org.apache.spark.sql.rapids.GpuJsonToStructs.$anonfun$doColumnar$1(GpuJsonToStructs.scala:178)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at org.apache.spark.sql.rapids.GpuJsonToStructs.doColumnar(GpuJsonToStructs.scala:176)
This looks almost identical to reading an list with only empty top level structs.