fastparquet Should `partition_on` columns be included in the pandas

I think they should be. Just checking if there was a reason they weren't @martindurant

In [20]: import pandas as pd

In [21]: import fastparquet as fp

In [22]: import json

In [23]: df = pd.DataFrame({"A": [1, 2], 'B': [3, 4]}, index=pd.Index(['a', 'b'], name='C'))

In [24]: fp.write("foo.parq", df, partition_on=['B'], file_scheme='hive')

In [25]: json.loads(fp.ParquetFile("foo.parq").fmd.key_value_metadata[0].value)
Out[25]:
{'columns': [{'metadata': None,
   'name': 'C',
   'numpy_type': 'object',
   'pandas_type': 'unicode'},
  {'metadata': None,
   'name': 'A',
   'numpy_type': 'int64',
   'pandas_type': 'int64'}],
 'index_columns': ['C'],
 'pandas_version': '0.22.0.dev0+131.g63e8527d3'}

Dec 07 '17 12:12 TomAugspurger

I suppose they should be in the global metadata, but not in the individual data files. Is it acceptable to have the metadata different in different places? You could have them in the data files only if you explicitly ingore them on load.

Dec 07 '17 14:12 martindurant

I would also include this in _common_metadata and _metadata files but not in the individual files. The individual files themselves should contain exactly the information that are needed to load them standalone. The *metadata files rather describe the schema of the whole dataset/table.

Dec 07 '17 16:12 xhochy

Should `partition_on` columns be included in the pandas_metadata?