fastparquet
fastparquet copied to clipboard
Should `partition_on` columns be included in the pandas_metadata?
I think they should be. Just checking if there was a reason they weren't @martindurant
In [20]: import pandas as pd
In [21]: import fastparquet as fp
In [22]: import json
In [23]: df = pd.DataFrame({"A": [1, 2], 'B': [3, 4]}, index=pd.Index(['a', 'b'], name='C'))
In [24]: fp.write("foo.parq", df, partition_on=['B'], file_scheme='hive')
In [25]: json.loads(fp.ParquetFile("foo.parq").fmd.key_value_metadata[0].value)
Out[25]:
{'columns': [{'metadata': None,
'name': 'C',
'numpy_type': 'object',
'pandas_type': 'unicode'},
{'metadata': None,
'name': 'A',
'numpy_type': 'int64',
'pandas_type': 'int64'}],
'index_columns': ['C'],
'pandas_version': '0.22.0.dev0+131.g63e8527d3'}
I suppose they should be in the global metadata, but not in the individual data files. Is it acceptable to have the metadata different in different places? You could have them in the data files only if you explicitly ingore them on load.
I would also include this in _common_metadata and _metadata files but not in the individual files. The individual files themselves should contain exactly the information that are needed to load them standalone. The *metadata files rather describe the schema of the whole dataset/table.