clumper read_jsonl method : Importing file just renamed from .json to .jsonl is allowed

The user can read in .json file by just renaming it to .jsonl. With the current code, Clumper would parse it one line as a big dictionary. Unexpected behaviour will happen during analysis.

To reproduce:

Rename pokemon.json to pokemon.jsonl (any json file really)
Read it and load to Clumper

from clumper import Clumper
wrongly_parsed = Clumper.read_jsonl("pokemon.jsonl")

You can see that the len returns 1

print(len(wrongly_parsed))

I couldn't find an elegant solution on how to verify if the file being read is actually JSONL apart from looking at its extension. Any suggestion is welcome.

Aug 17 '20 07:08 samarpan-rai

If you save a file that is .jsonl compatible with the name data.foobar then our read_jsonl("data.foobar") should still be able to work even if the extension is not aligned. The extension of a file does not guarantee the contents of it. This is the responsibility of the end-user, not our library.

We could try to perhaps add some warnings or check if each line that is read in read_jsonl is indeed valid json. Adding a checking mechanism or running a linter before loading seems like a fine feature.

Aug 17 '20 12:08 koaning

I just looked up the spec and it does seem like we're allowed to throw an error here. Each line on it's own needs to be valid json and technically that's not the case in this example.

OK. So a linter/checking mechanism per line sounds useful.

Aug 17 '20 12:08 koaning

I see that you removed by already removed by assert statemtns for checking the file extension. Then we don't have to exactly worry about it.

This linting/checking seems to be already handled by json.loads function as written in the documentation : "If the data being deserialized is not a valid JSON document, a JSONDecodeError will be raised." So in theory, if it returns JSONDecodeError, it will be captured by the try-except block and return it along with the RuntimeError. Do you think that is enough ?

Aug 17 '20 14:08 samarpan-rai

That sounds like it is worth a try. I don't know if the json module also offers something for jsonlines tho.

Aug 17 '20 17:08 koaning

Maybe we misunderstood each other. My current implementation imports renamed jsonlines in 1 line. It does it because the file reader just reads one line as there is no new line in the file. The json.dump function thinks it's a valid json because it is indeed a valid json. As far as I understand, jsonlines is just json separated by line separator.

Aug 17 '20 18:08 samarpan-rai

What do you mean with "imports renamed jsonlines"?

Aug 17 '20 18:08 koaning

I meant any .json file renamed to .jsonl. However, this particular aspect doesn't matter because you've removed the assert statement which checked for it.

The part that also worries me is that the user can provide json file to read_jsonl and Clumper parses it without complaining it. This could result in unexpected behaviour during analysis.

Furthermore, read_json and read_jsonl look very similar so I already see even me mistaking one for the other late at night. Instead we could take the approach panda's read_json takes and add another parameter lines where True means we want to read a jsonl file.

Aug 17 '20 19:08 samarpan-rai

There is some confusion that might happen there because json and jsonl are similar. That's true. For now I'm ok with the API as is since the documentation clearly lists the three readers. If we get a lot of issues on GitHub suggesting that it's really intimidating people then I might switch opinion.

I think your point on checking each line to double check that we're indeed dealing with a jsonl file instead of json one is valid though. That allows us to write a meaningful error message too. Something like "this looks more like a json file, are you sure you want to treat it as jsonl"?

Aug 18 '20 14:08 koaning

Should this be a Runtime error or warning?

Sep 15 '20 05:09 samarpan-rai

My check is based on the assumption that JSONL files are usually longer than 1 line. It will raise suspicion if the number of lines in the file is just 1. It's simple but hopefully not too simple. The only caveat is that it would have to read in the entire file to know this as there is no easy way (that I know of) to know the number of lines programmatically for both local and cloud-based files during each read which could slow users down

Sep 15 '20 05:09 samarpan-rai

I fear that that check is incomplete. Can't we do a check like;

is it valid json on each line? -> jsonl is legal
is there a single line of data and is it valid json? -> jsonl/json both legal
is it not valid json on each line but the file is legal json -> jsonl is not legal

Sep 15 '20 06:09 koaning

I fear that that check is incomplete. Can't we do a check like;
* is it valid json on each line? -> jsonl is legal

This is accomplishable.

* is there a single line of data and is it valid json? -> jsonl/json both legal

Saying this is legal would mean that json files can be read in as jsonl

* is it not valid json on each line but the file is legal json -> jsonl is not legal

what does it mean to say that a file is legal json?

Sep 16 '20 12:09 samarpan-rai

Saying this is legal would mean that json files can be read in as jsonl

Yes! But then again, if it is allowed in the spec either way then it's the user's responsibility to use the correct call, no?

what does it mean to say that a file is legal json?

I mean to say "we should throw an error".

Sep 16 '20 13:09 koaning

clumper clumper copied to clipboard

read_jsonl method : Importing file just renamed from .json to .jsonl is allowed

clumper
clumper copied to clipboard