fastparquet icon indicating copy to clipboard operation
fastparquet copied to clipboard

Should `partition_on` columns be included in the pandas_metadata?

Open TomAugspurger opened this issue 8 years ago • 2 comments

I think they should be. Just checking if there was a reason they weren't @martindurant

In [20]: import pandas as pd

In [21]: import fastparquet as fp

In [22]: import json

In [23]: df = pd.DataFrame({"A": [1, 2], 'B': [3, 4]}, index=pd.Index(['a', 'b'], name='C'))

In [24]: fp.write("foo.parq", df, partition_on=['B'], file_scheme='hive')

In [25]: json.loads(fp.ParquetFile("foo.parq").fmd.key_value_metadata[0].value)
Out[25]:
{'columns': [{'metadata': None,
   'name': 'C',
   'numpy_type': 'object',
   'pandas_type': 'unicode'},
  {'metadata': None,
   'name': 'A',
   'numpy_type': 'int64',
   'pandas_type': 'int64'}],
 'index_columns': ['C'],
 'pandas_version': '0.22.0.dev0+131.g63e8527d3'}

TomAugspurger avatar Dec 07 '17 12:12 TomAugspurger

I suppose they should be in the global metadata, but not in the individual data files. Is it acceptable to have the metadata different in different places? You could have them in the data files only if you explicitly ingore them on load.

martindurant avatar Dec 07 '17 14:12 martindurant

I would also include this in _common_metadata and _metadata files but not in the individual files. The individual files themselves should contain exactly the information that are needed to load them standalone. The *metadata files rather describe the schema of the whole dataset/table.

xhochy avatar Dec 07 '17 16:12 xhochy