json2parquet icon indicating copy to clipboard operation
json2parquet copied to clipboard

Partition by timestamp

Open Madhu1512 opened this issue 6 years ago • 5 comments

What is the best way to extract year, month, day from timestamp and use for partition of parquet when writing to disk or s3fs?

Madhu1512 avatar Mar 08 '18 01:03 Madhu1512

Good question. I have done this in Spark by creating a virtual column that represents the date of a datetime column, and partitioning on that. Not sure if there is a good way to do that here unfortunately, unless you know ahead of time and just write the data to that folder manually.

andrewgross avatar Mar 08 '18 01:03 andrewgross

In pandas, we do following way to get year, month, day columns added to table.

df['date'] = df['date'].map(lambda t: pd.to_datetime(t, format="%Y-%m-%dT%H:%M:%S.%fZ")) df['year'], df['month'], df['day'] = df['date'].apply(lambda x: x.year), df['date'].apply(lambda x: x.month), df['date'].apply(lambda x: x.day)

Is it possible to something similar at https://github.com/andrewgross/json2parquet/blob/master/json2parquet/client.py#L67

Madhu1512 avatar Mar 08 '18 02:03 Madhu1512

Its possible, but requires feeding all of the data through Python again. If you aren't concerned about performance, it can be done. I have some similar custom code that I use that converts data to Python Datetimes, filters by day, and writes the output files into partitioned folders.

If we can think of a good generalized implementation it would be nice to include it.

andrewgross avatar Apr 09 '18 21:04 andrewgross

Data could be partitioned by a specific column while reading the JSON file (in the case of ingesting JSON data through a file) then each partition stored to a different folder. doesn't seem to be very difficult, but is not exactly what @Madhu1512 was asking.

sojovi avatar Feb 01 '19 01:02 sojovi

Currently I pre-process my json in to different batches for partitioning. I did not include it here originally because I wasn't sure of a good balance between functionality (partitioning) and delivery (writing out put to disk, s3 etc). There are certainly ways around this, but I did not implement them at the time. Open to ideas and PRs though!

andrewgross avatar Feb 21 '19 02:02 andrewgross