spark-rapids
spark-rapids copied to clipboard
[FEA] Support json_tuple
I wish we can support json_tuple.
from pyspark.sql.functions import *
data = [("1", '''{"f1": "value1", "f2": "value2"}'''), ("2", '''{"f1": "value12"}''')]
df = spark.createDataFrame(data, ("key", "jstring"))
df.select(df.key, json_tuple(df.jstring, 'f1', 'f2')).collect()
Not-supported-messages:
!Exec <GenerateExec> cannot run on GPU because not all expressions can be replaced
! <JsonTuple> json_tuple(jstring#1, f1, f2) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.JsonTuple
In the short term we could probably implement this in terms of calling get_json_object multiple times. Not sure how well that would scale for very large numbers of objects, as we have to parse the JSON multiple times.
I messed up on this. For whatever reason json_tuple is implemented in terms of a generator instead of a normal expression. Because of this we are going to need to not only allow JsonTuple to be put onto the GPU. I am not sure how this is going to work because GpuGenerator is not really set up to deal with an expression like JsonTuplewhere it can take more than a single parameter. This is going to take a bit more thinking and probably some changes toGpuGenerateExec`.
Also the input filed names are not the same as what get_json_object uses. We are probably going to have to make sure that they are literal values, and json_tuple does not require them to be literals, then we are going to have to prepend the string with a "$." so that it matches the paths correctly and escape any special JSON path characters within the name of the field.
Still doable, but it is going to be a lot more work to make it happen.