spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[FEA] Support json_tuple

Open viadea opened this issue 3 years ago • 1 comments

I wish we can support json_tuple.

from pyspark.sql.functions import *
data = [("1", '''{"f1": "value1", "f2": "value2"}'''), ("2", '''{"f1": "value12"}''')]
df = spark.createDataFrame(data, ("key", "jstring"))
df.select(df.key, json_tuple(df.jstring, 'f1', 'f2')).collect()

Not-supported-messages:

  !Exec <GenerateExec> cannot run on GPU because not all expressions can be replaced
    ! <JsonTuple> json_tuple(jstring#1, f1, f2) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.JsonTuple

viadea avatar Oct 05 '22 17:10 viadea

In the short term we could probably implement this in terms of calling get_json_object multiple times. Not sure how well that would scale for very large numbers of objects, as we have to parse the JSON multiple times.

revans2 avatar Oct 05 '22 18:10 revans2

I messed up on this. For whatever reason json_tuple is implemented in terms of a generator instead of a normal expression. Because of this we are going to need to not only allow JsonTuple to be put onto the GPU. I am not sure how this is going to work because GpuGenerator is not really set up to deal with an expression like JsonTuplewhere it can take more than a single parameter. This is going to take a bit more thinking and probably some changes toGpuGenerateExec`.

Also the input filed names are not the same as what get_json_object uses. We are probably going to have to make sure that they are literal values, and json_tuple does not require them to be literals, then we are going to have to prepend the string with a "$." so that it matches the paths correctly and escape any special JSON path characters within the name of the field.

Still doable, but it is going to be a lot more work to make it happen.

revans2 avatar Dec 20 '22 19:12 revans2