deephaven-core icon indicating copy to clipboard operation
deephaven-core copied to clipboard

feat: Allow parquet column access by field_id

Open devinrsmith opened this issue 1 year ago • 3 comments

This allows the the resolution of a parquet column by field_id instead of by its "path". This is a lower-level option that will not typically be used by end-users; as such, this option has not been plumbed through to python. This feature will be used in follow-up PRs in combination with Iceberg's field-ids to improve column mappings.

Writing support has also been added.

Fixes #6128

devinrsmith avatar Sep 30 '24 22:09 devinrsmith

Do verify the nightlies pass before merging.

malhotrashivam avatar Sep 30 '24 22:09 malhotrashivam

Do verify the nightlies pass before merging.

Verified.

devinrsmith avatar Oct 01 '24 14:10 devinrsmith

I couldn't find any resources to confirm, but this does feel incorrect to me, having two columns with same field ID. For example, if we get a field ID by Iceberg, it would expect a single column, right?

Iceberg probably mandates the uniqueness of field-ids.

Parquet doesn't have any mandates wrt that. And even the column names aren't guaranteed to be unique. I need to find the reference I found earlier that the parquet format "strongly recommends" unique column names, but it's not even a guarantee.

devinrsmith avatar Oct 01 '24 16:10 devinrsmith

There is going to be a more general follow-up to this that allows for custom logic.

devinrsmith avatar Jan 13 '25 18:01 devinrsmith