Allow Parquet column access by field_id

Open devinrsmith opened this issue 1 year ago • 1 comments

When used in combination with external schemas or catalogs (such as Iceberg, or others) where columns may be renamed, removed, and added in arbitrary combination, Parquet provides a utility to attach a field_id in the SchemaElement to support properly mapping the data:

  /** When the original schema supports field ids, this will save the
   * original field id in the parquet schema
   */
  9: optional i32 field_id;

Right now, Deephaven only supports indexing into the parquet file via "columnName" and "path"; with path being the primary key.

public interface RowGroupReader {
    /**
     * Returns the accessor to a given Column Chunk
     *
     * @param columnName the name of the column
     * @param path the full column path
     * @return the accessor to a given Column Chunk, or null if the column is not present in this Row Group
     */
    @Nullable
    ColumnChunkReader getColumnChunk(@NotNull String columnName, @NotNull List<String> path);

We should add support for indexing based on a field_id.

This is in support of #6118.

Related, it has been noted that we use the following hierarchy to access path_in_schema as the primary key to access a row group, and absent other information, use the first element from that list (in some situations) to determine the column name.

FileMetaData ->
row_groups[rgIx] : RowGroup ->
columns[colIx] : ColumnChunk ->
meta_data: ColumnMetaData ->
path_in_schema: list<string>

This seems somewhat round-a-bout and fragile, as a parquet file can actually be empty and not have any row groups.

There is explicit documentation on RowGroup.columns

  /** Metadata for each column chunk in this row group.
   * This list must have the same order as the SchemaElement list in FileMetaData.
   **/
  1: required list<ColumnChunk> columns

which means in any context where we are dealing with a ColumnChunk, we should (/could) pass along the corresponding SchemaElement.

This also means we might prefer to resolve the column name from the actual Parquet schema.

There might be special consideration we need to take for nested structs, which we don't generally support, but want to make sure we don't break downstream users who may be reading files with nested structs and explicitly excluding them.

Sep 25 '24 16:09 devinrsmith

Our io.deephaven.parquet.base.RowGroupReader#getColumnChunk(java.lang.String, java.util.List<java.lang.String>) smells; we should not need to do the resolution mapping per column per row group; it should only need to happen once per column.

Ie, a given List<String> path or int fieldId should map to a specific columnIndex; that columnIndex can then be used across all row groups for that column to get the appropriate ColumnChunk (RowGroup.columns[columnIndex]).

Sep 25 '24 18:09 devinrsmith