mistql icon indicating copy to clipboard operation
mistql copied to clipboard

Support JSON lines files

Open ivbeg opened this issue 2 years ago • 8 comments

Please add support of JSON lines files https://jsonlines.org/ There are a lot of such files published and used. Sometimes they are huge and hard to convert to JSON

ivbeg avatar Apr 18 '22 13:04 ivbeg

Fantastic idea! No timeline yet on implementation, but definitely a very useful feature. I've run into this myself :)

evinism avatar Apr 19 '22 00:04 evinism

Actually @ivbeg, would you be able to describe your ideal interface for such a feature? Would the program run the query over each json line individually, or treat the whole file as a large array?

evinism avatar Apr 19 '22 04:04 evinism

@evinism It would be great to support both ways to process JSON lines files, but streaming feature would be more important since there are huge JSON lines files, up to 100GB+ compressed. I could provide several examples from public datasets if needed. It's nearly impossible to process such files as a large array.

I've developed cmd tool undatum (https://github.com/datacoon/undatum) that support data processing and conversion of JSON lines and BSON files. BSON is a binary format used by MongoDB NoSQL database, very similar to JSON lines . So I would like to integrate query language into undatum to use it with data processing/conversion operations. I've already used dictquery (https://github.com/cyberlis/dictquery) but it's good for filtering only.

ivbeg avatar Apr 19 '22 08:04 ivbeg

streaming mode for processing jsonl sounds right to me too. Not sure when I'll get to this, but definitely something I want to tackle.

evinism avatar Apr 19 '22 18:04 evinism

@evinism I've added experimental support of mistql to undatum, it's supported in main https://github.com/datacoon/undatum version 1.0.13 command "undatum query -q <yourquery> <filename>" filename could be csv, jsonl or bson.

I hope it could help.

ivbeg avatar Apr 20 '22 10:04 ivbeg

Adding @ilan-pinto to this thread. For now, let's work on getting this up and running in Python.

evinism avatar Jun 02 '22 06:06 evinism

Hi please assign it to me

ilan-pinto avatar Jun 02 '22 06:06 ilan-pinto

For reference, a possible interface for this feature could be as such:

tail file.log | python -m mistql.cli foo.bar --lines > processed.jsonl

Note that the query is performed in a streaming manner -- for each JSON line in file.log, the CLI spits out the query result for that line in processed.jsonl

evinism avatar Jun 03 '22 10:06 evinism