clumper
clumper copied to clipboard
read_jsonl method : Importing file just renamed from .json to .jsonl is allowed
The user can read in .json file by just renaming it to .jsonl. With the current code, Clumper would parse it one line as a big dictionary. Unexpected behaviour will happen during analysis.
To reproduce:
- Rename
pokemon.json
topokemon.jsonl
(any json file really) - Read it and load to
Clumper
from clumper import Clumper
wrongly_parsed = Clumper.read_jsonl("pokemon.jsonl")
- You can see that the
len
returns 1
print(len(wrongly_parsed))
I couldn't find an elegant solution on how to verify if the file being read is actually JSONL apart from looking at its extension. Any suggestion is welcome.
If you save a file that is .jsonl
compatible with the name data.foobar
then our read_jsonl("data.foobar")
should still be able to work even if the extension is not aligned. The extension of a file does not guarantee the contents of it. This is the responsibility of the end-user, not our library.
We could try to perhaps add some warnings or check if each line that is read in read_jsonl
is indeed valid json. Adding a checking mechanism or running a linter before loading seems like a fine feature.
I just looked up the spec and it does seem like we're allowed to throw an error here. Each line on it's own needs to be valid json and technically that's not the case in this example.
OK. So a linter/checking mechanism per line sounds useful.
I see that you removed by already removed by assert statemtns for checking the file extension. Then we don't have to exactly worry about it.
This linting/checking seems to be already handled by json.loads
function as written in the documentation : "If the data being deserialized is not a valid JSON document, a JSONDecodeError will be raised." So in theory, if it returns JSONDecodeError
, it will be captured by the try-except block and return it along with the RuntimeError
. Do you think that is enough ?
That sounds like it is worth a try. I don't know if the json
module also offers something for jsonlines tho.
Maybe we misunderstood each other. My current implementation imports renamed jsonlines in 1 line. It does it because the file reader just reads one line as there is no new line in the file. The json.dump function thinks it's a valid json because it is indeed a valid json. As far as I understand, jsonlines is just json separated by line separator.
What do you mean with "imports renamed jsonlines"?
I meant any .json file renamed to .jsonl. However, this particular aspect doesn't matter because you've removed the assert statement which checked for it.
The part that also worries me is that the user can provide json file to read_jsonl
and Clumper parses it without complaining it. This could result in unexpected behaviour during analysis.
Furthermore, read_json
and read_jsonl
look very similar so I already see even me mistaking one for the other late at night. Instead we could take the approach panda's read_json takes and add another parameter lines
where True means we want to read a jsonl file.
There is some confusion that might happen there because json
and jsonl
are similar. That's true. For now I'm ok with the API as is since the documentation clearly lists the three readers. If we get a lot of issues on GitHub suggesting that it's really intimidating people then I might switch opinion.
I think your point on checking each line to double check that we're indeed dealing with a jsonl
file instead of json
one is valid though. That allows us to write a meaningful error message too. Something like "this looks more like a json file, are you sure you want to treat it as jsonl"?
Should this be a Runtime error or warning?
My check is based on the assumption that JSONL files are usually longer than 1 line. It will raise suspicion if the number of lines in the file is just 1. It's simple but hopefully not too simple. The only caveat is that it would have to read in the entire file to know this as there is no easy way (that I know of) to know the number of lines programmatically for both local and cloud-based files during each read which could slow users down
I fear that that check is incomplete. Can't we do a check like;
- is it valid json on each line? -> jsonl is legal
- is there a single line of data and is it valid json? -> jsonl/json both legal
- is it not valid json on each line but the file is legal json -> jsonl is not legal
I fear that that check is incomplete. Can't we do a check like;
* is it valid json on each line? -> jsonl is legal
This is accomplishable.
* is there a single line of data and is it valid json? -> jsonl/json both legal
Saying this is legal would mean that json files can be read in as jsonl
* is it not valid json on each line but the file is legal json -> jsonl is not legal
what does it mean to say that a file is legal json?
Saying this is legal would mean that json files can be read in as jsonl
Yes! But then again, if it is allowed in the spec either way then it's the user's responsibility to use the correct call, no?
what does it mean to say that a file is legal json?
I mean to say "we should throw an error".