json2parquet
json2parquet copied to clipboard
Partition by timestamp
What is the best way to extract year, month, day from timestamp and use for partition of parquet when writing to disk or s3fs?
Good question. I have done this in Spark by creating a virtual column that represents the date of a datetime column, and partitioning on that. Not sure if there is a good way to do that here unfortunately, unless you know ahead of time and just write the data to that folder manually.
In pandas, we do following way to get year, month, day columns added to table.
df['date'] = df['date'].map(lambda t: pd.to_datetime(t, format="%Y-%m-%dT%H:%M:%S.%fZ")) df['year'], df['month'], df['day'] = df['date'].apply(lambda x: x.year), df['date'].apply(lambda x: x.month), df['date'].apply(lambda x: x.day)
Is it possible to something similar at https://github.com/andrewgross/json2parquet/blob/master/json2parquet/client.py#L67
Its possible, but requires feeding all of the data through Python again. If you aren't concerned about performance, it can be done. I have some similar custom code that I use that converts data to Python Datetimes, filters by day, and writes the output files into partitioned folders.
If we can think of a good generalized implementation it would be nice to include it.
Data could be partitioned by a specific column while reading the JSON file (in the case of ingesting JSON data through a file) then each partition stored to a different folder. doesn't seem to be very difficult, but is not exactly what @Madhu1512 was asking.
Currently I pre-process my json in to different batches for partitioning. I did not include it here originally because I wasn't sure of a good balance between functionality (partitioning) and delivery (writing out put to disk, s3 etc). There are certainly ways around this, but I did not implement them at the time. Open to ideas and PRs though!