[BUG] Reading JSON file saved from Series fails
Describe the bug
Reading a JSON file created from a Series .to_json(path, orient='records', lines=True) call leads to a Input data is not a valid JSON file when trying to read with .read_json(path, orient='records', lines=True) (with or without engine='cudf' parameter).
NOTE: this is only a problem when saving Series in this format -- DataFrame objects are saved properly.
Steps/Code to reproduce bug
cudf.Series([1,2,3,4,5]).to_json('sample.json', lines=True, orient='records')
cudf.read_json('sample.json', lines=True, orient='records')
The output file looks as follows:
1
2
3
4
5
Expected behavior
The file is read back properly and produces a valid cudf.Series object.
Environment overview (please complete the following information)
- Environment location: Docker
- Method of cuDF install: Docker
- RAPIDS v 0.16 pull from nightly.
The output file looks as follows:
1 2 3 4 5
We currently use pandas JSON writer and this output isn't a valid JSON file, but pandas can read back this output because of a dedicated parameter(typ) it has in read_json which will determine what the file contents could be. For example:
>>> ser.to_json('s')
>>> ser
0 1
1 2
2 3
3 4
Name: a, dtype: int64
>>> buf = ser.to_json()
>>> buf
'{"0":1,"1":2,"2":3,"3":4}'
>>> buf = ser.to_json()
>>> buf
'{"0":1,"1":2,"2":3,"3":4}'
>>> pd.read_json(buf, typ='series')
0 1
1 2
2 3
3 4
dtype: int64
>>> buf = ser.to_json(orient='records', lines=True)
>>> buf
'1\n2\n3\n4'
>>> pd.read_json(buf, typ='series', orient='records', lines=True)
0 1
1 2
2 3
3 4
dtype: int64
@kkraus14 I think we should support similar param(typ) in cudf aswell.
But it also raises the question if '1\n2\n3\n4' should be considered a valid json, if not what could be a valid form in this case.
@kkraus14 I think we should support similar param(
typ) incudfaswell.
If there was reasonable formats of JSON that could be supported then sure, but generally JSONLines is the most scalable / common one, so I'm not sure how much sense it makes to support other formats.
But it also raises the question if
'1\n2\n3\n4'should be considered a valid json, if not what could be a valid form in this case.
This is definitely not valid JSON. Could you raise an issue on Pandas if one doesn't already exist?
Cross-linking pandas issue: https://github.com/pandas-dev/pandas/issues/37100
This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
@galipremsagar @mroeschke WDYT of the status of this now? The last comment on the associated pandas issue was from Matt, so deferring to you two to decide if there's any action item here still.
Yeah I think the OP exhibits the intended behavior as the output isn't necessary valid JSON (albeit the correctly formatted output), so I think this is a won't fix
Hmm is that contradicting your last statement on the associated pandas issue? Is there a behavior change needed on the pandas end, then? IIUC pandas is allowing this input through. Unless you're suggesting that it's OK for cudf not to support it since it's not valid JSON, but pandas will continue to do so because it already does, in which case we should consider the implications of the divergence for cudf.pandas.
Sorry, I should have clarified that the to_json behavior that both cudf and pandas currently share is the intended behavior. But, yes I suppose cudf.read_json should be able to round trip that output like pandas.read_json
In [1]: import cudf, io
In [3]: import pandas as pd
In [8]: cudf.read_json(io.StringIO(cudf.Series([1,2,3,4,5]).to_json(lines=True, orient='records')), lines=True, orient="records")
RuntimeError: CUDF failure at: cudf/cpp/src/io/json/json_tree.cu:272: JSON Parser encountered an invalid format at location 2
In [9]: pd.read_json(io.StringIO(pd.Series([1,2,3,4,5]).to_json(lines=True, orient='records')), lines=True, orient="records")
Out[9]:
0
0 1
1 2
2 3
3 4
4 5