parquet-format
parquet-format copied to clipboard
DRAFT: Extension types
Rationale for this change
What changes are included in this PR?
Do these changes have PoC implementations?
Generally seems reasonable to me.
What makes something appropriate as an extension type rather than a basic supported type? GEOMETRY is currently supported natively, which seems to be simply a binary blob with a metadata CRS string. Is the problem that adding such types is cumbersome?
What makes something appropriate as an extension type rather than a basic supported type? GEOMETRY is currently supported natively, which seems to be simply a binary blob with a metadata CRS string. Is the problem that adding such types is cumbersome?
Yes, mainly the fact that adding new types is cumbersome, so there is a trade-off between how useful we expect the type to be to the broader community. The other questions which comes into play are things like if stats are important, how well an extension type would work in this context (I forget if this design addresses that issue). IMO, GEOMETRY might have been considered for an extension type if we had this facility.
As a datapiont, in the Arrow project most new types have been added as extension types I believe.
Are you just providing a name for an introduced type? The examples don't show using any special handling -- IP address as FIXED_LEN_BYTE_ARRAY(16) and f64tensor as JSON -- and there is no example to help explain leaf vs. non-leaf handling. Is there some more complex vision? If so an example would be helpful.
Are you just providing a name for an introduced type? The examples don't show using any special handling -- IP address as FIXED_LEN_BYTE_ARRAY(16) and f64tensor as JSON -- and there is no example to help explain leaf vs. non-leaf handling. Is there some more complex vision? If so an example would be helpful.
I think this needs to be fleshed out some more for leaf/non-leaf and possible custom statistics for aggregates types (e.g. point cloud data). I don't think f64tensor is supposed to be JSON, just its metadata.