spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[BUG] GetJsonObject should return null for invalid query instead of throwing an exception

Open revans2 opened this issue 1 year ago • 8 comments

Describe the bug There are some invalid paths, like in python

f.get_json_object('jsonStr', '''$.'a''').alias('qsqa2'),

which results in the cudf code throwing an exception saying the path is invalid. But the Spark code just returns a null for an invalid path. We should do the same.

ai.rapids.cudf.CudfException: CUDF failure at:/.../json_path.cu:657: Encountered invalid JSONPath input string
	at ai.rapids.cudf.ColumnView.getJSONObject(Native Method)
	at ai.rapids.cudf.ColumnView.getJSONObject(ColumnView.java:2995)

or

f.get_json_object('jsonStr', 'a').alias('just_a'),
ai.rapids.cudf.CudfException: CUDF failure at:/.../json_path.cu:593: Unrecognized JSONPath operator
	at ai.rapids.cudf.ColumnView.getJSONObject(Native Method)
	at ai.rapids.cudf.ColumnView.getJSONObject(ColumnView.java:2995)

revans2 avatar Jan 18 '24 14:01 revans2

This should be addressed by PR 15082 for the cases where JSONPath (query) contains invalid syntax.

SurajAralihalli avatar Feb 27 '24 01:02 SurajAralihalli

Cudf PR 15082 does fix the issue (returns null instead of throwing an error) in the cases where cudf thinks the query has an invalid argument. This means the example stated in the description will not result in cudf exception.

However, an attempt to handle it within the spark-rapids is being made by PR 10466 by handling it before passing to cudf. I'll let authors of PR 10466 close this issue.

SurajAralihalli avatar Mar 01 '24 22:03 SurajAralihalli

@thirtiseven has this issue been verified to be closed by https://github.com/NVIDIA/spark-rapids/pull/10466 ?

cc: @revans2

sameerz avatar Mar 01 '24 23:03 sameerz

@thirtiseven has this issue been verified to be closed by #10466 ?

Thanks for remind. Technically, this issue was fixed by #15082 and #10466, but the test case test_get_json_object_invalid_path still fails, so I didn't notice.

Now the case fails for a different reason: $.'a is a valid path from Spark's check, so it is normalized to $[''a'] and passed to kernel in plugin, which results in

ai.rapids.cudf.CudfException: CUDF failure on: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-690-cuda11/thirdparty/cudf/cpp/src/json/json_path.cu:631: Invalid empty name in JSONPath query string

I'm going to move this case out of test_get_json_object_invalid_path and find/create an issue and test to place it. Before that I think we can keep this issue open for a while.

thirtiseven avatar Mar 04 '24 02:03 thirtiseven

Filed issue for the above case: https://github.com/NVIDIA/spark-rapids/issues/10537

thirtiseven avatar Mar 04 '24 06:03 thirtiseven

The PR #15082 doesn't address the issue with $[''a'], but it does fix the problem with $.'a. The exception "Invalid empty name in JSONPath query string" occurs for $[''a'] both before and after #15082.

This issue remains a limitation of cudf's get_json_object function.

SurajAralihalli avatar Mar 04 '24 06:03 SurajAralihalli

Now new implementation parses path in Spark-Rapids by using Spark code, so we should check the invalid path in Spark-Rapids before calling Kernel code.

res-life avatar Mar 18 '24 08:03 res-life

Will be fixed by: https://github.com/NVIDIA/spark-rapids/pull/10581

res-life avatar Mar 19 '24 01:03 res-life