arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[Python] Native support for UUID

Open Fokko opened this issue 3 years ago • 13 comments

Describe the enhancement requested

In Apache Iceberg we have support for the UUID type. I think it would be nice to also support this in (Py)Arrow natively instead of having an extension.

Component(s)

Python

Fokko avatar Dec 21 '22 11:12 Fokko

This was brought up years ago: https://lists.apache.org/thread/k2zvgoq62dyqmw3mj2t6ozfzhzkjkc4j but seems to have been dropped in favor of the extension.

Adding types requires an ML discussion and vote so it would probably be best to (re)start the discussion there.

assignUser avatar Dec 21 '22 14:12 assignUser

Hey @Fokko how about making it a canonical extension? [1][2]

[1] https://arrow.apache.org/docs/format/CanonicalExtensions.html [2] https://lists.apache.org/thread/sxd5fhc42hb6svs79t3fd79gkqj83pfh

rok avatar Dec 21 '22 14:12 rok

This would be good to make a canonical extension type, given there are many integration points for it:

  • It is a logical type in Parquet
  • It is a logical type in many databases, who may return it in Flight
  • Arrow-native and arrow compatible engines (such as DuckDB) have UUID as a datatype and could exchange that over the C data interface.

This seems like a good candidate to prototype spreading a canonical extension type through the ecosystem, given how simple and ubiquitous is it.

I may look into this in about a month, if someone else doesn't beat me to it.

wjones127 avatar Jun 09 '23 02:06 wjones127

I agree UUID sounds as a good fit for adding as a canonical extension type.

jorisvandenbossche avatar Jun 13 '23 12:06 jorisvandenbossche

@wjones127 would be awesome if you have time to implement this, because the only proposed solution at SO is to use duckdb (for conversion of tables with UUID to parquet)

arogozhnikov avatar Aug 22 '23 06:08 arogozhnikov

I found a UUID in src/arrow/testing/extension_type.h

class ARROW_TESTING_EXPORT UuidArray : public ExtensionArray {
 public:
  using ExtensionArray::ExtensionArray;
};

class ARROW_TESTING_EXPORT UuidType : public ExtensionType {
 public:
  UuidType() : ExtensionType(fixed_size_binary(16)) {}

  std::string extension_name() const override { return "uuid"; }

  bool ExtensionEquals(const ExtensionType& other) const override;

  std::shared_ptr<Array> MakeArray(std::shared_ptr<ArrayData> data) const override;

  Result<std::shared_ptr<DataType>> Deserialize(
      std::shared_ptr<DataType> storage_type,
      const std::string& serialized) const override;

  std::string Serialize() const override { return "uuid-serialized"; }
};

Maybe we just need to draft and make it available?

mapleFU avatar Aug 22 '23 06:08 mapleFU

I've opened #37298 using logic from src/arrow/testing/extension_type.h and will also add a Pyhon wrapper. As per canonical extension process I'll should also start a ML discussion and vote.

rok avatar Aug 22 '23 09:08 rok

@Fokko I'm curious, can you explain what "natively" supporting an UUID meaning? An UUID is just a bunch of 16 opaque bytes with no actionable contents...

(in general, people seem to think that UUID gives them something better than pure random strings, for whatever cargo-culted reason)

pitrou avatar Aug 22 '23 09:08 pitrou

@pitrou That's true, for PyIceberg it is about maintaining the type information. We have converters to go from and to Iceberg ⇔ PyArrow schemas. Going from Iceberg an Iceberg UUID to a fixed[16] in Arrow works, but we can not go back.

(in general, people seem to think that UUID gives them something better than pure random strings, for whatever cargo-culted reason)

We'll you want to avoid utf8 if not needed. In Iceberg the UUID is also often used to apply bucket partitioning, which works well on fixed 16 bytes.

Fokko avatar Aug 22 '23 09:08 Fokko

We have converters to go from and to Iceberg ⇔ PyArrow schemas. Going from Iceberg an Iceberg UUID to a fixed[16] in Arrow works, but we can not go back.

Unless you expect Iceberg and Arrow to support the exact same types, you should probably have a mechanism to store and restore Iceberg metadata to/from Arrow data.

pitrou avatar Aug 22 '23 09:08 pitrou

can you explain what "natively" supporting an UUID meaning?

for me that's support of python's built-in uuid.UUID type when reading/writing dataframes from pandas. As already mentioned, you can convert to bytes, but during loading it won't be converted back.

arogozhnikov avatar Aug 22 '23 09:08 arogozhnikov

Looking forward to this. Would be great if there were a way (through casting from string or a compute function pyarrow.compute.parse_uuid) to convert string arrays (containing ASCII strings like 871b64ef-9b7b-4fd0-9858-9c44631a1e0e) to UUID arrays and vice-versa (without going through python). Not sure if that's planned for https://github.com/apache/arrow/pull/37298.

shenker avatar Dec 11 '23 01:12 shenker

Looking forward to this. Would be great if there were a way (through casting from string or a compute function pyarrow.compute.parse_uuid) to convert string arrays (containing ASCII strings like 871b64ef-9b7b-4fd0-9858-9c44631a1e0e) to UUID arrays and vice-versa (without going through python). Not sure if that's planned for #37298.

That sounds quite useful @shenker. You mean something like this?

import pyarrow as pa
pa.array(["0"*16], pa.list_(pa.string(), 16)).cast(pa.list_(pa.binary(), 16))

or rather

pa.array(["0"*16], pa.list_(pa.string(), 16)).cast(uuid())

I'm not sure how much work setting up extension types would be with cast kernels. Either way this idea would be better off in a new issue :).

rok avatar Dec 11 '23 16:12 rok