json2parquet Do less copying/passes through the data

Do less copying/passes through the data

Open andrewgross opened this issue 6 years ago • 3 comments

This will never be perfect in python, but right now its pretty disgusting what we do when we load JSON and convert it to a columnar format.

Sep 13 '17 03:09 andrewgross

You are reading twice the json file (here _convert_data_without_schema and here _convert_data_with_column_names) only in the case the user didnt provide any schema.

I would like to work on this, I think both functions could be merged in a single function that extracts the column names and at the same time assigns the values to the corresponding column.

Jan 23 '19 16:01 sojovi

That would be great! There should be good test coverage for the results from those functions. I definitely wrote some inefficient code here to get it working, and did not return to improve it. Feel free to open a PR!

Jan 23 '19 16:01 andrewgross

Merged both functions in a single one, now is reading and converting the whole JSON in a single loop (I dont excel as Python dev but I think is a decent improvement). Regarding tests, I'll work with a different version of JSON file (more complex, columns, etc) and see how it works. I would like to benchmark the new version with the previous version but not a priority I think by now.

Jan 24 '19 21:01 sojovi

json2parquet json2parquet copied to clipboard

Do less copying/passes through the data

json2parquet
json2parquet copied to clipboard