[Python] Native support for UUID
Describe the enhancement requested
In Apache Iceberg we have support for the UUID type. I think it would be nice to also support this in (Py)Arrow natively instead of having an extension.
Component(s)
Python
This was brought up years ago: https://lists.apache.org/thread/k2zvgoq62dyqmw3mj2t6ozfzhzkjkc4j but seems to have been dropped in favor of the extension.
Adding types requires an ML discussion and vote so it would probably be best to (re)start the discussion there.
Hey @Fokko how about making it a canonical extension? [1][2]
[1] https://arrow.apache.org/docs/format/CanonicalExtensions.html [2] https://lists.apache.org/thread/sxd5fhc42hb6svs79t3fd79gkqj83pfh
This would be good to make a canonical extension type, given there are many integration points for it:
- It is a logical type in Parquet
- It is a logical type in many databases, who may return it in Flight
- Arrow-native and arrow compatible engines (such as DuckDB) have UUID as a datatype and could exchange that over the C data interface.
This seems like a good candidate to prototype spreading a canonical extension type through the ecosystem, given how simple and ubiquitous is it.
I may look into this in about a month, if someone else doesn't beat me to it.
I agree UUID sounds as a good fit for adding as a canonical extension type.
@wjones127 would be awesome if you have time to implement this, because the only proposed solution at SO is to use duckdb (for conversion of tables with UUID to parquet)
I found a UUID in src/arrow/testing/extension_type.h
class ARROW_TESTING_EXPORT UuidArray : public ExtensionArray {
public:
using ExtensionArray::ExtensionArray;
};
class ARROW_TESTING_EXPORT UuidType : public ExtensionType {
public:
UuidType() : ExtensionType(fixed_size_binary(16)) {}
std::string extension_name() const override { return "uuid"; }
bool ExtensionEquals(const ExtensionType& other) const override;
std::shared_ptr<Array> MakeArray(std::shared_ptr<ArrayData> data) const override;
Result<std::shared_ptr<DataType>> Deserialize(
std::shared_ptr<DataType> storage_type,
const std::string& serialized) const override;
std::string Serialize() const override { return "uuid-serialized"; }
};
Maybe we just need to draft and make it available?
I've opened #37298 using logic from src/arrow/testing/extension_type.h and will also add a Pyhon wrapper.
As per canonical extension process I'll should also start a ML discussion and vote.
@Fokko I'm curious, can you explain what "natively" supporting an UUID meaning? An UUID is just a bunch of 16 opaque bytes with no actionable contents...
(in general, people seem to think that UUID gives them something better than pure random strings, for whatever cargo-culted reason)
@pitrou That's true, for PyIceberg it is about maintaining the type information. We have converters to go from and to Iceberg ⇔ PyArrow schemas. Going from Iceberg an Iceberg UUID to a fixed[16] in Arrow works, but we can not go back.
(in general, people seem to think that UUID gives them something better than pure random strings, for whatever cargo-culted reason)
We'll you want to avoid utf8 if not needed. In Iceberg the UUID is also often used to apply bucket partitioning, which works well on fixed 16 bytes.
We have converters to go from and to Iceberg ⇔ PyArrow schemas. Going from Iceberg an Iceberg UUID to a
fixed[16]in Arrow works, but we can not go back.
Unless you expect Iceberg and Arrow to support the exact same types, you should probably have a mechanism to store and restore Iceberg metadata to/from Arrow data.
can you explain what "natively" supporting an UUID meaning?
for me that's support of python's built-in uuid.UUID type when reading/writing dataframes from pandas. As already mentioned, you can convert to bytes, but during loading it won't be converted back.
Looking forward to this. Would be great if there were a way (through casting from string or a compute function pyarrow.compute.parse_uuid) to convert string arrays (containing ASCII strings like 871b64ef-9b7b-4fd0-9858-9c44631a1e0e) to UUID arrays and vice-versa (without going through python). Not sure if that's planned for https://github.com/apache/arrow/pull/37298.
Looking forward to this. Would be great if there were a way (through casting from string or a compute function
pyarrow.compute.parse_uuid) to convert string arrays (containing ASCII strings like871b64ef-9b7b-4fd0-9858-9c44631a1e0e) to UUID arrays and vice-versa (without going through python). Not sure if that's planned for #37298.
That sounds quite useful @shenker. You mean something like this?
import pyarrow as pa
pa.array(["0"*16], pa.list_(pa.string(), 16)).cast(pa.list_(pa.binary(), 16))
or rather
pa.array(["0"*16], pa.list_(pa.string(), 16)).cast(uuid())
I'm not sure how much work setting up extension types would be with cast kernels. Either way this idea would be better off in a new issue :).