parquet-format icon indicating copy to clipboard operation
parquet-format copied to clipboard

DRAFT: Extension types

Open pitrou opened this issue 1 year ago • 1 comments

Rationale for this change

What changes are included in this PR?

Do these changes have PoC implementations?

pitrou avatar Sep 19 '24 10:09 pitrou

Generally seems reasonable to me.

emkornfield avatar Sep 24 '24 07:09 emkornfield

What makes something appropriate as an extension type rather than a basic supported type? GEOMETRY is currently supported natively, which seems to be simply a binary blob with a metadata CRS string. Is the problem that adding such types is cumbersome?

abellgithub avatar Oct 16 '25 12:10 abellgithub

What makes something appropriate as an extension type rather than a basic supported type? GEOMETRY is currently supported natively, which seems to be simply a binary blob with a metadata CRS string. Is the problem that adding such types is cumbersome?

Yes, mainly the fact that adding new types is cumbersome, so there is a trade-off between how useful we expect the type to be to the broader community. The other questions which comes into play are things like if stats are important, how well an extension type would work in this context (I forget if this design addresses that issue). IMO, GEOMETRY might have been considered for an extension type if we had this facility.

As a datapiont, in the Arrow project most new types have been added as extension types I believe.

emkornfield avatar Oct 16 '25 14:10 emkornfield

Are you just providing a name for an introduced type? The examples don't show using any special handling -- IP address as FIXED_LEN_BYTE_ARRAY(16) and f64tensor as JSON -- and there is no example to help explain leaf vs. non-leaf handling. Is there some more complex vision? If so an example would be helpful.

abellgithub avatar Oct 16 '25 18:10 abellgithub

Are you just providing a name for an introduced type? The examples don't show using any special handling -- IP address as FIXED_LEN_BYTE_ARRAY(16) and f64tensor as JSON -- and there is no example to help explain leaf vs. non-leaf handling. Is there some more complex vision? If so an example would be helpful.

I think this needs to be fleshed out some more for leaf/non-leaf and possible custom statistics for aggregates types (e.g. point cloud data). I don't think f64tensor is supposed to be JSON, just its metadata.

emkornfield avatar Oct 23 '25 02:10 emkornfield