csvkit icon indicating copy to clipboard operation
csvkit copied to clipboard

in2csv: Support for large JSON files

Open dhulke opened this issue 6 years ago • 2 comments

Recently I tried converting a 1GB json file to csv so I would be able to import this file into Sqlite and run a few queries for data analysis. It turns out in2csv crashes out with a Memory Error. So I wrote a quick json stream parser, somewhat like a sax parser and generated my own csv (I hardcoded the total number of columns). My question is: are you guys planning on adding support for a stream like json parser so we could transform json to csv without a limit to file size?

If so, I could maybe try to implement that. I saw that today agate.Table.from_json loads the entire thing into memory and agate.Table.to_csv actually needs the entire thing in memory to loop over it calling writer.writerow(). I initially thought about changing agate.Table.from_object (what from_json returns) to return a generator that would stream parse the json, but im not sure that would work as the total number of columns could very half way through the json. I even thought about skipping the header and parsing the entire file and then at the end, rewriting the header but then I would have to fseek to every row with the wrong number of columns and add commas and nulls.

Do you guys have any ideas/plans on how to go about this?

Edit: What I actually thought about doing was adding sqlite to agate and stream parse the json straight into sqlite and as I discover new columns, run a quick alter table adding the column with a specified default value and moving on. At the end, chage agate.Table.to_csv to actually call sqlite export function. That could work and would be somewhat faster than fseeking my way around, but sounds like a bit of a hack.

dhulke avatar Apr 28 '18 19:04 dhulke

Streaming JSON has the issues you mention (like unpredictable headers), but I'd be happy to merge some solution. The non-agate reader/writers for conversion are in https://github.com/wireservice/csvkit/tree/master/csvkit/convert

jpmckinney avatar May 20 '18 16:05 jpmckinney

I have the same issue writing large df.to_json.

mtparagon5 avatar May 07 '19 18:05 mtparagon5

Noting that, even at the time of this discussion, in2csv would stream newline-delimited JSON if --format ndjson and `--no-inference`` were set. This issue affects other JSON texts.


im not sure that would work as the total number of columns could very half way through the json

The typical solution is to read the input twice. Of course, you can't read standard input twice. In that case, you would have to store data on disk, in whatever format you prefer (plain text, zodb, duckdb, leveldb, etc. – I've seen many solutions).


Outside newline-delimited JSON, you would need an iterative JSON parser like YAJL to yield each item of an array, or each JSON text in concatenated JSON (newline-delimited JSON is a special case of concatenated jSON).

Installing YAJL is not straightforward for many users, and including YAJL into csvkit would be a big lift (csvkit is pure Python and runs on all major OS).

Anyway, in short, the constraints/scope of csvkit mean that large JSON files (other than newline-delimited JSON files) are not supported (unless you happen to have enough memory).

jpmckinney avatar Oct 18 '23 02:10 jpmckinney