Parquet.jl
Parquet.jl copied to clipboard
DateTime reader support
Is it currently possible to read data in as a DateTime? If not, what would need to be done for this to be added?
(Sort of partner to #108, although i don't know how related reader and writer support are).
Current behaviour seems to be to read datetimes in as Int64 values.
For example, generating some data in Python:
>>> import pandas as pd
>>>
>>> t1 = pd.Timestamp('2018-01-01 06:00:00+0000', tz='UTC')
>>> t2 = pd.Timestamp('2018-01-01 07:00:00+0000', tz='UTC')
>>> df = pd.DataFrame([t1, t2], columns=["datetime_utc"])
>>> df["datetime_utc"].dtype
datetime64[ns, UTC]
>>>
>>> df.to_parquet("datetimes.parquet")
and then reading it in Julia
julia> pq_file = Parquet.File("datetimes.parquet")
Parquet file: datetimes.parquet
version: 1
nrows: 2
created by: parquet-cpp version 1.5.1-SNAPSHOT
cached: 0 column chunks
julia> schema(pq_file)
Schema:
required schema {
optional INT64 datetime_utc # (from TIMESTAMP_MICROS)
}
i’ve tried to use the map_logical_types keyword, for example Dict(["datetime_utc"] => (DateTime, Parquet.logical_timestamp)), but this errors with ERROR: unsupported storage type 2 for DateTime.
I think this might just be a bug on this line with the wrong/incomplete storage type listed https://github.com/JuliaIO/Parquet.jl/blob/a21df68a57add5b6c48902f4ec775146fe0ef3a1/src/codec.jl#L227
The INT96 is defined here
https://github.com/JuliaIO/Parquet.jl/blob/a21df68a57add5b6c48902f4ec775146fe0ef3a1/src/PAR2/PAR2_types.jl#L11
From the same file: type 2 is INT32.
I I suspect a branch for that needs to be added.
Maybe for INT64 also?
We need to have an implementation that can decode Int64 logical timestamps and then plug it in there.
This is the format specification: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp.
The Parquet.logical_timestamp method currently handles only Int96 format and can't be used to decode Int64 encoded format.