fastparquet icon indicating copy to clipboard operation
fastparquet copied to clipboard

The parquet format specification is not followed for Interval type (i.e. timedeltas)

Open mgab opened this issue 1 year ago • 3 comments

Describe the issue:

The way timedelta values (a.k.a. durations, intervals...) are stored in parquet does not follow the file format specification. According to the parquet specification, the logical type Interval should be stored as:

INTERVAL is used for an interval of time. It must annotate a fixed_len_byte_array of length 12. This array stores three little-endian unsigned integers that represent durations at different granularities of time. The first stores a number in months, the second stores a number in days, and the third stores a number in milliseconds. This representation is independent of any particular timezone or date. (...)

Currently, fastparquet does not follow the format specification on this type. This affects the ability to read parquets written with other tools or to read with other tools parquets written with fastparquet if there is any field with this type.

I guess it might be a known issue rather than a bug, but I couldn't find info about it.

Minimal Complete Verifiable Example:

import pandas as pd
from fastparquet import write

df = pd.DataFrame([{'seconds': 30, 'duration': pd.to_timedelta(30, unit='seconds')}])

write('/test/test.parquet', df)

Then use either hangxie/parquet-tools, ktrueda/parquet-tools or any similar tool to inspect the schema to find that it looks like:

{"Tag":"name=Schema",
 "Fields":[
  {"Tag":"name=Seconds, type=INT64, repetitiontype=OPTIONAL"},
  {"Tag":"name=Duration, type=INT64, convertedtype=TIME_MICROS, repetitiontype=OPTIONAL"}
]}

instead of something along the lines of

{"Tag":"name=Duckdb_schema",
 "Fields":[
  {"Tag":"name=Seconds, type=INT32, convertedtype=INT_32, repetitiontype=OPTIONAL"},
  {"Tag":"name=Duration, type=FIXED_LEN_BYTE_ARRAY, convertedtype=INTERVAL, length=12, repetitiontype=OPTIONAL"}
]}

Anything else we need to know?:

There's a bit more context on this StackOverflow question

Environment:

  • Pandas version: 2.2.2
  • Python version: 2024.5.0
  • Operating System: macOS 14.6.1
  • Install method (conda, pip, source): pip

mgab avatar Oct 03 '24 15:10 mgab

It isn't "known" in the sense that anyone has raised this before, but the INTERVAL type it a particularly unwieldy encoding, as you can see. pyarrow does not use it, but stores the data as INT64 like fastparquet.

martindurant avatar Oct 03 '24 18:10 martindurant

Fair, and yet fastparquet and pyarrow do not seem to be compatible when writing and reading this type on a parquet file:

  • writing a timedelta with fastparquet and loading it with pyarrow transforms it to a datetime.time
  • writing a timedelta with pyarrow and loading it with fastparquet transforms it to an int

Only when reading it with the same tool (either of the two) you end up preserving the timedelta type.

In any case, what would be the proper solution? Would a PR that implements the format specification for the INTERVAL type be desirable? Would there be any concern about the compatibility against pyarrow?

mgab avatar Oct 04 '24 10:10 mgab

Would a PR that implements the format specification for the INTERVAL type be desirable?

You are welcome to try, but I think it might be a little work. It is not a high priority for me (we have had this model for a long time!). Fixing reading arrow with the INT encoding is perhaps more important.

martindurant avatar Oct 09 '24 14:10 martindurant