clumper icon indicating copy to clipboard operation
clumper copied to clipboard

read_jsonl method : Importing file just renamed from .json to .jsonl is allowed

Open samarpan-rai opened this issue 4 years ago • 13 comments

The user can read in .json file by just renaming it to .jsonl. With the current code, Clumper would parse it one line as a big dictionary. Unexpected behaviour will happen during analysis.

To reproduce:

  1. Rename pokemon.json to pokemon.jsonl (any json file really)
  2. Read it and load to Clumper
from clumper import Clumper
wrongly_parsed = Clumper.read_jsonl("pokemon.jsonl")
  1. You can see that the len returns 1
print(len(wrongly_parsed))

I couldn't find an elegant solution on how to verify if the file being read is actually JSONL apart from looking at its extension. Any suggestion is welcome.

samarpan-rai avatar Aug 17 '20 07:08 samarpan-rai

If you save a file that is .jsonl compatible with the name data.foobar then our read_jsonl("data.foobar") should still be able to work even if the extension is not aligned. The extension of a file does not guarantee the contents of it. This is the responsibility of the end-user, not our library.

We could try to perhaps add some warnings or check if each line that is read in read_jsonl is indeed valid json. Adding a checking mechanism or running a linter before loading seems like a fine feature.

koaning avatar Aug 17 '20 12:08 koaning

I just looked up the spec and it does seem like we're allowed to throw an error here. Each line on it's own needs to be valid json and technically that's not the case in this example.

OK. So a linter/checking mechanism per line sounds useful.

koaning avatar Aug 17 '20 12:08 koaning

I see that you removed by already removed by assert statemtns for checking the file extension. Then we don't have to exactly worry about it.

This linting/checking seems to be already handled by json.loads function as written in the documentation : "If the data being deserialized is not a valid JSON document, a JSONDecodeError will be raised." So in theory, if it returns JSONDecodeError, it will be captured by the try-except block and return it along with the RuntimeError. Do you think that is enough ?

samarpan-rai avatar Aug 17 '20 14:08 samarpan-rai

That sounds like it is worth a try. I don't know if the json module also offers something for jsonlines tho.

koaning avatar Aug 17 '20 17:08 koaning

Maybe we misunderstood each other. My current implementation imports renamed jsonlines in 1 line. It does it because the file reader just reads one line as there is no new line in the file. The json.dump function thinks it's a valid json because it is indeed a valid json. As far as I understand, jsonlines is just json separated by line separator.

samarpan-rai avatar Aug 17 '20 18:08 samarpan-rai

What do you mean with "imports renamed jsonlines"?

koaning avatar Aug 17 '20 18:08 koaning

I meant any .json file renamed to .jsonl. However, this particular aspect doesn't matter because you've removed the assert statement which checked for it.

The part that also worries me is that the user can provide json file to read_jsonl and Clumper parses it without complaining it. This could result in unexpected behaviour during analysis.

Furthermore, read_json and read_jsonl look very similar so I already see even me mistaking one for the other late at night. Instead we could take the approach panda's read_json takes and add another parameter lines where True means we want to read a jsonl file.

samarpan-rai avatar Aug 17 '20 19:08 samarpan-rai

There is some confusion that might happen there because json and jsonl are similar. That's true. For now I'm ok with the API as is since the documentation clearly lists the three readers. If we get a lot of issues on GitHub suggesting that it's really intimidating people then I might switch opinion.

I think your point on checking each line to double check that we're indeed dealing with a jsonl file instead of json one is valid though. That allows us to write a meaningful error message too. Something like "this looks more like a json file, are you sure you want to treat it as jsonl"?

koaning avatar Aug 18 '20 14:08 koaning

Should this be a Runtime error or warning?

samarpan-rai avatar Sep 15 '20 05:09 samarpan-rai

My check is based on the assumption that JSONL files are usually longer than 1 line. It will raise suspicion if the number of lines in the file is just 1. It's simple but hopefully not too simple. The only caveat is that it would have to read in the entire file to know this as there is no easy way (that I know of) to know the number of lines programmatically for both local and cloud-based files during each read which could slow users down

samarpan-rai avatar Sep 15 '20 05:09 samarpan-rai

I fear that that check is incomplete. Can't we do a check like;

  • is it valid json on each line? -> jsonl is legal
  • is there a single line of data and is it valid json? -> jsonl/json both legal
  • is it not valid json on each line but the file is legal json -> jsonl is not legal

koaning avatar Sep 15 '20 06:09 koaning

I fear that that check is incomplete. Can't we do a check like;

* is it valid json on each line? -> jsonl is legal

This is accomplishable.

* is there a single line of data and is it valid json? -> jsonl/json both legal

Saying this is legal would mean that json files can be read in as jsonl

* is it not valid json on each line but the file is legal json -> jsonl is not legal

what does it mean to say that a file is legal json?

samarpan-rai avatar Sep 16 '20 12:09 samarpan-rai

Saying this is legal would mean that json files can be read in as jsonl

Yes! But then again, if it is allowed in the spec either way then it's the user's responsibility to use the correct call, no?

what does it mean to say that a file is legal json?

I mean to say "we should throw an error".

koaning avatar Sep 16 '20 13:09 koaning