arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[Format] Add Opaque canonical extension type

Open lidavidm opened this issue 1 year ago • 30 comments

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

lidavidm avatar May 25 '24 02:05 lidavidm

Updated NA -> Null, add "vendor_name"

lidavidm avatar May 28 '24 04:05 lidavidm

Is there any desire to include any other information that the producer has available? If some future Parquet standard allows for extension types it might make sense for the Parquet reader to pass on the extension metadata somehow (or maybe that should only be considered when the feature is first requested, which might be never?)

In that case, wouldn't it just get mapped as a different extension type? I suppose I can see this being used as a generic "Parquet gave an extension type that I don't know how to deal with". I think we can add a new field at that point.

lidavidm avatar May 28 '24 23:05 lidavidm

Thoughts on using "Unknown" instead of "Other" ?

It's not entirely clear to me what the motivation is to bucket those other types together in a single arrow.other extension type, instead of just letting ADBC drivers define specific extension types for types that don't match to one of the arrow types (eg the ADBC postgres driver could annotate a binary field with adbc_driver_postgresql.polygon or adbc.postgresql.polygon when encountering a polygon type)

jorisvandenbossche avatar May 30 '24 10:05 jorisvandenbossche

The set of unknown types isn't a closed set.

lidavidm avatar May 30 '24 12:05 lidavidm

Thoughts on using "Unknown" instead of "Other" ?

I recently was looking into this concept in Parquet and apparently they use the term "undefined" for this there. I think "unknown" best communicates this concept (as in, this is not a type that is known to the implementation); however, I don't have strong feelings about the name.

It's not entirely clear to me what the motivation is to bucket those other types together in a single arrow.other extension

For something like nanoarrow this might work; however, in Arrow C++ the extension type information is aggressively dropped, so it's very possible that that a pyarrow consumer would never be able to query the extension name to get the relevant information (what the type name is and where it came from). One could argue that also maybe Arrow C++ should stop aggressively dropping extension information, but even with that I think that this extension type is a better solution than an extension name that a database might not even know that it has until it receives a query.

paleolimbot avatar May 30 '24 12:05 paleolimbot

The set of unknown types isn't a closed set.

But also this extension type is not a closed set, as you can put whatever you want in the type_name and vendor_name metadata?

For something like nanoarrow this might work; however, in Arrow C++ the extension type information is aggressively dropped

Then we should maybe consider if that behaviour in C++ is practical, otherwise this "arrow.other" type feels like a workaround for a limitation in the C++ API. In the past I had considered whether we (C++/pyarrow) should have a generic UnknownExtensionType subclass that is used for any extension type that is not recognized (i.e. for which the name is not registered). See https://github.com/apache/arrow/issues/22572 (however, doing that now with dropping of the metadata in the schema's field metadata (as we do for registered types) would be a breaking change).

jorisvandenbossche avatar May 30 '24 13:05 jorisvandenbossche

To give a concrete example, to try to clarify:

  • GDAL reads data from a file that contains a column with WKB values and returns Arrow data with a column of binary type with the "ogc.wkb" annotation
    • Should GDAL instead also use the "arrow.other" type?
  • The ADBC postgresql driver reads data from a postgres table that has a column with the polygon type, and returns Arrow data with a column of binary type with the "arrow.other" annotation (and type_name "polygon" in the metadata)
    • Why would the driver not return a custom extension type?

In both cases, a producer of Arrow data is returning data with a data type that is not natively supported. So from an Arrow implementation (like Arrow C++) point of view, both cases seem equivalent. But so why would we treat the one case differently (assuming we add explicit support for "arrow.other" extension type) than the other?

Maybe what I am after is some better guidelines for when data producers should use the "arrow.other" type vs a custom extension type name.

jorisvandenbossche avatar May 30 '24 13:05 jorisvandenbossche

But also this extension type is not a closed set, as you can put whatever you want in the type_name and vendor_name metadata?

Then you risk colliding with an actual extension type, and you aren't communicating that the intermediate system doesn't know how to interpret the type. Part of the use case is just in catalogs like Flight SQL where there isn't necessarily concrete data in play.

GDAL wouldn't use it for types it knows about.

I would argue the typical extension type is to claim support for a type that is not part of Arrow's type system with some particular semantics (UUID, or geometry, etc.). "Other" is different, it is to explicitly disclaim support while also avoiding a hard error or silent data loss. See the examples already written in the proposal.

lidavidm avatar May 30 '24 13:05 lidavidm

otherwise this "arrow.other" type feels like a workaround for a limitation in the C++ API.

I think they are considering two different scenarios: an "extension type" is perhaps something that has a definition or specification somewhere, whereas when the Postgres driver encounters a type that it didn't know about at compile time, all ADBC has is const char* typname and const char* vendor_name and it needs a way to pass that on. There is no guarantee regarding the content of the vendor name or the type name, so you might need something like adbc:::PostgreSQL:::Some.Type in the off chance one of the components contained a period.

I would also like to see pyarrow flag unknown extension types, but using that system for this use-case feels very hack-y to me.

paleolimbot avatar May 30 '24 13:05 paleolimbot

While I certainly understand the distinction of it being the intermediate system not knowing how to interpret the type (where in the example of GDAL the type is of course definitly known to GDAL), the "not knowing" part is bit fluid.

One of the examples given is the PostgreSQL polygon type, which AFAIK is a built-in postgres type. The driver not knowing about this type seems to be a choice you make in the implementation (although disclaimer I am not familiar with the actual implementation). The postgres driver could perfectly know about the polygon type if you wanted? But I assume that is not done right now because there is no matching type on the Arrow side to map it to.


Something else: since it seems this mostly comes from ADBC use cases, it would also be an option to use an "adbc.other" extension type with the described semantics? That avoids requiring a specific implementation for this on the Arrow side.

jorisvandenbossche avatar May 30 '24 14:05 jorisvandenbossche

The driver not knowing about this type seems to be a choice you make in the implementation

Probably a better example would be a user-defined type (type alias for a built-in or extension type) or a type from a loaded extension (which the ADBC driver would never have an opportunity to know). I think the example was changed to "geometry", which is from an extension (although we could hard-code support for PostGIS).

That avoids requiring a specific implementation for this on the Arrow side.

If extension type information wasn't dropped in multiple implementations this would probably work. Ensuring extension type registration in the presence of shared/static libraries and multiple languages is not trivial (as we know from considering this for geoarrow 🙂 ).

since it seems this mostly comes from ADBC use cases

I think this issue comes up at any place data is converted to Arrow from something else. In R we (I) invented the undocumented arrow.r.vctrs extension type to handle something like this (unknown type name but storage that we do know how to convert). That's an example where the ability to attach some extra payload would be helpful (but also one where an extension type is fine because it only applies to one binding).

paleolimbot avatar May 30 '24 14:05 paleolimbot

~~Do we want to try to prevent haphazard use of type_name and vendor_name?~~

~~Without some attempt to canonicalize well-known pairs of type and vendor, we could lose opportunities for interoperability. For example one developer uses PostgreSQL, another uses pgsql, another uses postgres, and so on.~~

~~Does that matter?~~ EDIT: Disregard; I now understand that this is not the point of this type.

ianmcook avatar May 31 '24 16:05 ianmcook

In that case someone should start a process to standardize an extension type.

lidavidm avatar Jun 03 '24 00:06 lidavidm

Are there any more comments here?

lidavidm avatar Jun 10 '24 01:06 lidavidm

Just a note that I was reminded of a possible use-case for this recently (reading Parquet files with a logical type that was defined in a later version of Parquet: https://github.com/apache/arrow/pull/41765

Without some attempt to canonicalize well-known pairs of type and vendor, we could lose opportunities for interoperability.

Perhaps another wording is that the vendor name is informational purposes or as an escape hatch that is more likely to enable a workaround (rather than some value an application should be relying on). For example, duckdb currently returns its extension "geometry" type as Arrow binary that many people mistakenly try to interpret as WKB. The real solution is for DuckDB to just return geoarrow extension types in this case; however, this proposal would let DuckDB return an Other extension type, which would enable a workaround (an R package could spot it and issue an informative error or do the required conversion until DuckDB can export its extension types to Arrow in this way).

paleolimbot avatar Jun 10 '24 01:06 paleolimbot

Updated wording to explicitly recommend against (1) making up extension types on the fly (2) trying to create conventions around vendor/type name here

lidavidm avatar Jun 10 '24 07:06 lidavidm

I'm working on expanding the examples and re-organizing the prose (am not gonna get it done tonight), thanks for all the comments so far.

I'll also rename it Unknown since I've come around to that being clearer at first glance than Other (also probably avoids any confusion between JDBC Other and this type since they're not quite the same either)

lidavidm avatar Jun 12 '24 14:06 lidavidm

Hmm. I'd rather keep the name "Other" as "Unknown" has proven ambiguous for Parquet (where there is a logical type named "Unknown"). @jorisvandenbossche What do you think?

pitrou avatar Jun 13 '24 08:06 pitrou

It's ambiguous in Parquet because it uses, confusingly, "Unknown" to mean "Null" (which means you can only have unknown data if there is no data, only nulls), while there is also "Undefined" for unknown logical types. But personally I don't think we should let ourselves from using the more descriptive term because of Parquet (in Arrow, we never use "unknown" in the context of the null type, so I would say there is no ambiguity here)

Personally I find "Other" rather broad, because this is not an extension type to just bucket any other type, it is meant for a very specific type of extension types that are unknown to the producer.

jorisvandenbossche avatar Jun 13 '24 11:06 jorisvandenbossche

It's ambiguous in Parquet because it uses, confusingly, "Unknown" to mean "Null" (which means you can only have unknown data if there is no data, only nulls), while there is also "Undefined" for unknown logical types. But personally I don't think we should let ourselves from using the more descriptive term because of Parquet (in Arrow, we never use "unknown" in the context of the null type, so I would say there is no ambiguity here)

Personally I find "Other" rather broad, because this is not an extension type to just bucket any other type, it is meant for a very specific type of extension types that are unknown to the producer.

Ok, let's go with "Unknown" then. Or perhaps "Opaque"?

pitrou avatar Jun 13 '24 12:06 pitrou

I'll leave it as Unknown. Are there any final comments?

I'll try to put up implementations in Java and C++/Python soon.

lidavidm avatar Jun 14 '24 01:06 lidavidm

Well there might be a concrete reason to rename the type: "UnknownExtensionType" is already taken by a random (albeit deprecated) PyArrow extension type

lidavidm avatar Jun 24 '24 06:06 lidavidm

Perhaps Micah's suggestion of arrow.unsupported would be best?

lidavidm avatar Jun 24 '24 06:06 lidavidm

I would reiterate my suggestion of arrow.opaque :-)

pitrou avatar Jun 24 '24 14:06 pitrou

Semi-serious suggestion: arrow.unknowable. (see Ignoramus et ignorabimus).

rok avatar Jun 24 '24 16:06 rok

Slight preference for opaque. Both unsupported and unknowable slightly imply that "this can never be supported" or "this can never be known".

westonpace avatar Jun 24 '24 17:06 westonpace

How about unfamiliar?

rok avatar Jun 24 '24 17:06 rok

I've gone with Opaque, I'm still working on the C++/Python impl here but I've renamed/updated the Java impl

lidavidm avatar Jun 27 '24 07:06 lidavidm

Added C++/Python.

I had to adjust cast_null to allow casting extension types with null backing arrays to null.

lidavidm avatar Jun 28 '24 09:06 lidavidm

C++/Java/Python implementations are ready now

lidavidm avatar Jul 02 '24 05:07 lidavidm