dataframe
dataframe copied to clipboard
Implement TimeStampXXXTZVector for parquet isAdjustedToUTC timestamp columns
Hi. I will PR this.
The following python code will generate a parquet with timestamp columns in us ns ms
adjusted to UTC (1)
Reading it using https://github.com/Kotlin/dataframe/pull/577 , org.jetbrains.kotlinx.dataframe.io.ArrowReadingImplKt#readField
will throw NotImplementedError("reading from TimeStampXXXTZVector is not implemented")
This PR implements TimeStampXXXTZVector
following https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
Please note parquet file format has no seconds precision but only MILLIS or MICROS, NANOS
so my implementation of TimeStampSecTZVector
finds seconds from milliseconds.
Applying this code under my local checkout of #577, precisions return correctly for us ns ms
as :
@Test
fun testReadTimestamp() {
val frame = DataFrame.readParquet(
URL("file:/home/lperez/Bureau/work/pocs/lavaret/python/timestamps_with_utc_and_local.parquet")
)
val columnTypes = frame.columnTypes()
println("columnTypes: $columnTypes")
println(frame)
}
columnTypes: [kotlinx.datetime.LocalDateTime, kotlinx.datetime.LocalDateTime, kotlinx.datetime.LocalDateTime, kotlinx.datetime.LocalDateTime, kotlinx.datetime.LocalDateTime]
timestamp_utc timestamp_local timestamp_brussels timestamp_nanos timestamp_millis
0 2024-01-01T12:00:00.123456 2024-01-01T12:00:00.123456 2024-01-01T11:00:00.123456 2024-01-01T12:00:00.123456789 2024-01-01T12:00:00.123
(1)
zsh 10474 (git)-[main]-% python3 create-timestamp-parquet.py
shape: (1, 5)
┌─────────────────────┬─────────────────┬────────────────────────────────┬──────────────────────┬─────────────────────────────┐
│ timestamp_utc ┆ timestamp_local ┆ timestamp_brussels ┆ timestamp_nanos ┆ timestamp_millis │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs, UTC] ┆ datetime[μs] ┆ datetime[μs, UTC] ┆ datetime[ns, UTC] ┆ datetime[ms, UTC] │
╞═════════════════════╪═════════════════╪════════════════════════════════╪══════════════════════╪═════════════════════════════╡
│ 2024-01-01 ┆ 2024-01-01 ┆ 2024-01-01 11:00:00.123456 UTC ┆ 2024-01-01 ┆ 2024-01-01 12:00:00.123 UTC │
│ 12:00:00.123456 UTC ┆ 12:00:00.123456 ┆ ┆ 12:00:00.123456789 … ┆ │
└─────────────────────┴─────────────────┴────────────────────────────────┴──────────────────────┴─────────────────────────────┘
zsh 10345 (git)-[main]-% parquet-tools inspect timestamps_with_utc_and_local.parquet
############ Column(timestamp_brussels) ############
name: timestamp_brussels
path: timestamp_brussels
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MICROS
compression: ZSTD (space_saved: -26%)
############ Column(timestamp_nanos) ############
name: timestamp_nanos
path: timestamp_nanos
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=nanoseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): NONE
compression: ZSTD (space_saved: -26%)
############ Column(timestamp_millis) ############
name: timestamp_millis
path: timestamp_millis
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: Timestamp(isAdjustedToUTC=true, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false)
converted_type (legacy): TIMESTAMP_MILLIS
compression: ZSTD (space_saved: -26%)
(1)
import polars as pl
import pandas as pd
import pyarrow as pa
df = pl.DataFrame({
"timestamp_utc": [
pd.Timestamp('2024-01-01 12:00:00.123456', tz='UTC').to_pydatetime(), # UTC timestamp
],
"timestamp_local": [
pd.Timestamp('2024-01-01 12:00:00.123456').to_pydatetime() # Local timestamp without timezone
],
"timestamp_brussels": [
pd.Timestamp('2024-01-01 12:00:00.123456', tz='Europe/Brussels').tz_convert('UTC').to_pydatetime() # Brussels time converted to UTC
],
"timestamp_nanos": [
'2024-01-01 12:00:00.123456789'
],
"timestamp_millis": [
'2024-01-01 12:00:00.123'
]
}).with_columns(
pl.col("timestamp_nanos").str.to_datetime("%F %X.%9f", time_unit="ns")
.dt.replace_time_zone("UTC")
).with_columns(
pl.col("timestamp_millis").str.to_datetime("%F %X.%3f", time_unit="ms")
.dt.replace_time_zone("UTC")
)
df.write_parquet("timestamps_with_utc_and_local.parquet")
print(df)