iceberg Spark: Add read/write support for UUIDs from bytes

Apache Iceberg version

1.5.2 (latest release)

Query engine

Spark

Please describe the bug 🐞

I can insert a string column to an iceberg UUID column thanks to https://github.com/apache/iceberg/pull/7399

df = df.withColumn("id", lit(str(uuid.uuid4())))

but I can't insert a byte column to an iceberg UUID column

df = df.withColumn("id", lit(uuid.uuid4().bytes))

thanks all

Jul 05 '24 07:07 raphaelauv

@raphaelauv would you be interested in contributing a fix for this?

Jul 05 '24 08:07 nastra

hey @nastra, I do not have the time to contribute this feature right now, thanks for the proposition :+1:

until , I'm sharring an hacky bypass :sweat_smile: :

df = df.withColumn(
    "id", 
    F.regexp_replace(
        F.lower(F.hex("id")), 
        "(.{8})(.{4})(.{4})(.{4})(.{12})", 
        "$1-$2-$3-$4-$5"
    )
)

Jul 06 '24 08:07 raphaelauv

I can give this a shot @nastra. Although I need to read the UUID PR first.

Jul 08 '24 03:07 anuragmantri

I walked through the code and I was also able to reproduce this issue for parquet writes with a test.

java.lang.IllegalArgumentException: Invalid UUID string: d��Iu���>�M�`
	at java.base/java.util.UUID.fromString(Unknown Source)
	at org.apache.iceberg.spark.data.SparkParquetWriters$UUIDWriter.write(SparkParquetWriters.java:426)
	at org.apache.iceberg.spark.data.SparkParquetWriters$UUIDWriter.write(SparkParquetWriters.java:411)
	at org.apache.iceberg.parquet.ParquetValueWriters$StructWriter.write(ParquetValueWriters.java:581)
	at org.apache.iceberg.parquet.ParquetWriter.add(ParquetWriter.java:135)

It looks like the visitor incorrectly casts byte array to string because of our conversion to spark types here. Should we do this casting correctly at a higher level than SparkParquetWriters?

@RussellSpitzer @nastra

Aug 02 '24 16:08 anuragmantri

@anuragmantri I believe this is the correct place to do the casting. Spark itself doesn't support UUID as a type and so you can only represent it as a string when you write a UUID.

Aug 05 '24 07:08 nastra

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

Feb 02 '25 00:02 github-actions[bot]

No stale

Feb 02 '25 09:02 raphaelauv

@anuragmantri are you still interested in picking this one up?

Feb 02 '25 11:02 RussellSpitzer

Thanks for the ping, I missed this one. I will continue working on this.

I tried to add support uuid bytes earlier. Where I was stuck was that the uuid.uuid4().bytes is UUID in binary format. In the type conversion layer, we would have to differentiate between

Binary - BinaryType in Spark
a UUID represented as a binary - String in Spark

I was wondering if this can be done before TypeToSparkType.

Any thoughts on how this can be done @RussellSpitzer @nastra ?

Feb 02 '25 23:02 anuragmantri