json2parquet Partition by timestamp

What is the best way to extract year, month, day from timestamp and use for partition of parquet when writing to disk or s3fs?

Mar 08 '18 01:03 Madhu1512

Good question. I have done this in Spark by creating a virtual column that represents the date of a datetime column, and partitioning on that. Not sure if there is a good way to do that here unfortunately, unless you know ahead of time and just write the data to that folder manually.

Mar 08 '18 01:03 andrewgross

In pandas, we do following way to get year, month, day columns added to table.

df['date'] = df['date'].map(lambda t: pd.to_datetime(t, format="%Y-%m-%dT%H:%M:%S.%fZ")) df['year'], df['month'], df['day'] = df['date'].apply(lambda x: x.year), df['date'].apply(lambda x: x.month), df['date'].apply(lambda x: x.day)

Is it possible to something similar at https://github.com/andrewgross/json2parquet/blob/master/json2parquet/client.py#L67

Mar 08 '18 02:03 Madhu1512

Its possible, but requires feeding all of the data through Python again. If you aren't concerned about performance, it can be done. I have some similar custom code that I use that converts data to Python Datetimes, filters by day, and writes the output files into partitioned folders.

If we can think of a good generalized implementation it would be nice to include it.

Apr 09 '18 21:04 andrewgross

Data could be partitioned by a specific column while reading the JSON file (in the case of ingesting JSON data through a file) then each partition stored to a different folder. doesn't seem to be very difficult, but is not exactly what @Madhu1512 was asking.

Feb 01 '19 01:02 sojovi

Currently I pre-process my json in to different batches for partitioning. I did not include it here originally because I wasn't sure of a good balance between functionality (partitioning) and delivery (writing out put to disk, s3 etc). There are certainly ways around this, but I did not implement them at the time. Open to ideas and PRs though!

Feb 21 '19 02:02 andrewgross

json2parquet json2parquet copied to clipboard

Partition by timestamp

json2parquet
json2parquet copied to clipboard