Provide callback to allow user defined key-value metadata merging strategy
When merging footers, Parquet doesn't know how to merge conflicting user defined key-value metadata entries, and simply throws. It would be better to provide callbacks to let users define metadata merging strategies.
For example, in Spark SQL, we store our own schema information in Parquet files as key-value metadata (similar to parquet-avro). While trying to add schema merging support for reading Parquet files with different but compatible schemas, InitContext.getMergedKeyValueMetaData throws because we have different Spark SQL schemas stored in different Parquet data files. Thus, we have to overwrite ParquetInputFormat and merge the schema within getSplits, which is kinda hacky and inconvenient.
Reporter: Cheng Lian / @liancheng
Related issues:
- Release parquet-mr 1.6.0 (is blocked by)
Note: This issue was originally created as PARQUET-194. Please see the migration documentation for further details.
Ryan Blue / @rdblue: This should go away because we shouldn't need to merge metadata. PARQUET-139 updates ParquetInputFormat so that the ReadContext is initialized for each file independently on the task side.
Ryan Blue / @rdblue: Linking to 1.6.0 release issue, which makes this no longer needed.
Cheng Lian / @liancheng: Thanks @rdblue, I'm closing this.
Cheng Lian / @liancheng:
@rdblue After double thinking about this, I think providing such a callback is still useful. Because we also need to merge user defined key-value metadata in ParquetOutputCommitter for writing summary files. Currently, if any conflicting user defined key-value pair is found, Parquet simply throws an exception and gives up writing the summary file.
I noticed that in PARQUET-179, Parquet developers tried to retire both _metadata and _common_metadata at first, but later on decided to keep _common_metadata for frameworks and tools like Pig, where no centralized schema management service is available.
Ryan Blue / @rdblue:
True, we decided we needed to keep the code to produce _common_metadata around for a while longer. That isn't to say I think it should be used, except in some specific cases. The problem is that information in that file could easily be incorrect from normal operations on a folder of Parquet data, like adding new files. A better option would be to track the metadata elsewhere, like the Hive metastore or with another dataset management library like Kite. Parquet doesn't manage entire datasets of Parquet files and I'm not sure that it should be expected to.