json2parquet
json2parquet copied to clipboard
Do less copying/passes through the data
This will never be perfect in python, but right now its pretty disgusting what we do when we load JSON and convert it to a columnar format.
You are reading twice the json file (here _convert_data_without_schema and here _convert_data_with_column_names) only in the case the user didnt provide any schema.
I would like to work on this, I think both functions could be merged in a single function that extracts the column names and at the same time assigns the values to the corresponding column.
That would be great! There should be good test coverage for the results from those functions. I definitely wrote some inefficient code here to get it working, and did not return to improve it. Feel free to open a PR!
Merged both functions in a single one, now is reading and converting the whole JSON in a single loop (I dont excel as Python dev but I think is a decent improvement). Regarding tests, I'll work with a different version of JSON file (more complex, columns, etc) and see how it works. I would like to benchmark the new version with the previous version but not a priority I think by now.