gluonts icon indicating copy to clipboard operation
gluonts copied to clipboard

Flexible Arrow/Parquet data schema

Open eugeneteoh opened this issue 2 years ago • 2 comments

Description

Currently, parquet has a fixed schema, requiring attributes/columns to be named in a specific way (start, target, feat_dynamic_real, feat_static_cat etc.). It would be useful if these attributes can be defined in similarly to PandasDataset. Something I have in mind is:

gluonts.dataset.arrow.ParquetFile(
    path,
    item_id, # Item ID column name
    target, # Target column name
    timestamp, # Timestamp column name
    feat_dynamic_real, # Scalar or list of columns of dynamic real features
    ...
)

eugeneteoh avatar Aug 24 '22 14:08 eugeneteoh

I think this may apply to more than just Arrow/Parquet, but also JSONLines maybe. Also, there may be multiple columns that contain e.g. features that one would want to stack together.

lostella avatar Aug 24 '22 15:08 lostella

The ParquetFile abstraction should be pretty generic, not imposing any fields.

However, FileDataset does have the assumption of what columns are named.

Generally supporting is something we are looking into.

jaheba avatar Aug 24 '22 15:08 jaheba