gluonts
gluonts copied to clipboard
Flexible Arrow/Parquet data schema
Description
Currently, parquet has a fixed schema, requiring attributes/columns to be named in a specific way (start, target, feat_dynamic_real, feat_static_cat etc.). It would be useful if these attributes can be defined in similarly to PandasDataset
. Something I have in mind is:
gluonts.dataset.arrow.ParquetFile(
path,
item_id, # Item ID column name
target, # Target column name
timestamp, # Timestamp column name
feat_dynamic_real, # Scalar or list of columns of dynamic real features
...
)
I think this may apply to more than just Arrow/Parquet, but also JSONLines maybe. Also, there may be multiple columns that contain e.g. features that one would want to stack together.
The ParquetFile
abstraction should be pretty generic, not imposing any fields.
However, FileDataset
does have the assumption of what columns are named.
Generally supporting is something we are looking into.