spark-rapids [FEA] Support json

I wish we can support json_tuple.

from pyspark.sql.functions import *
data = [("1", '''{"f1": "value1", "f2": "value2"}'''), ("2", '''{"f1": "value12"}''')]
df = spark.createDataFrame(data, ("key", "jstring"))
df.select(df.key, json_tuple(df.jstring, 'f1', 'f2')).collect()

Not-supported-messages:

  !Exec <GenerateExec> cannot run on GPU because not all expressions can be replaced
    ! <JsonTuple> json_tuple(jstring#1, f1, f2) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.JsonTuple

Oct 05 '22 17:10 viadea

In the short term we could probably implement this in terms of calling get_json_object multiple times. Not sure how well that would scale for very large numbers of objects, as we have to parse the JSON multiple times.

Oct 05 '22 18:10 revans2

I messed up on this. For whatever reason json_tuple is implemented in terms of a generator instead of a normal expression. Because of this we are going to need to not only allow JsonTuple to be put onto the GPU. I am not sure how this is going to work because GpuGenerator is not really set up to deal with an expression like JsonTuplewhere it can take more than a single parameter. This is going to take a bit more thinking and probably some changes toGpuGenerateExec`.

Also the input filed names are not the same as what get_json_object uses. We are probably going to have to make sure that they are literal values, and json_tuple does not require them to be literals, then we are going to have to prepend the string with a "$." so that it matches the paths correctly and escape any special JSON path characters within the name of the field.

Still doable, but it is going to be a lot more work to make it happen.

Dec 20 '22 19:12 revans2

spark-rapids
spark-rapids copied to clipboard

[FEA] Support json_tuple

spark-rapids spark-rapids copied to clipboard

[FEA] Support json_tuple

spark-rapids
spark-rapids copied to clipboard