spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[BUG] JsonToStructs fails to parse all empty dicts and invalid lines

Open revans2 opened this issue 1 year ago • 1 comments

Describe the bug We know that there are some issues in CUDF with parsing empty lines. We first tried to fix this by passing in an empty dictionary as a place holder '{}' but this caused other problems because CUDF is not happy to produce a table with no columns in it. Or perhaps more accurately some of our code is not happy with that. We worked around this by adding in a column that was requested and setting it to null. That works for empty lines, but it shows up as a problem if all of the lines are {}, [], or all of them have something in them, but it is invalid.

We really should just fix the underlying problem instead of trying to work around it. This also exists in ScanJson, but I have not formally added a test for it yet.

revans2 avatar Feb 23 '24 15:02 revans2

Currently this is throwing a NullPointerException in CUDF on the java side. I think we can probably fix it without too much trouble.

revans2 avatar Mar 14 '24 16:03 revans2

This appears to have been fixed. All of the tests pass.

revans2 avatar Oct 18 '24 14:10 revans2