iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Spark: Add read/write support for UUIDs from bytes

Open raphaelauv opened this issue 1 year ago • 9 comments

Apache Iceberg version

1.5.2 (latest release)

Query engine

Spark

Please describe the bug 🐞

I can insert a string column to an iceberg UUID column thanks to https://github.com/apache/iceberg/pull/7399

df = df.withColumn("id", lit(str(uuid.uuid4())))

but I can't insert a byte column to an iceberg UUID column

df = df.withColumn("id", lit(uuid.uuid4().bytes))

thanks all

raphaelauv avatar Jul 05 '24 07:07 raphaelauv

@raphaelauv would you be interested in contributing a fix for this?

nastra avatar Jul 05 '24 08:07 nastra

hey @nastra, I do not have the time to contribute this feature right now, thanks for the proposition :+1:

until , I'm sharring an hacky bypass :sweat_smile: :

df = df.withColumn(
    "id", 
    F.regexp_replace(
        F.lower(F.hex("id")), 
        "(.{8})(.{4})(.{4})(.{4})(.{12})", 
        "$1-$2-$3-$4-$5"
    )
)

raphaelauv avatar Jul 06 '24 08:07 raphaelauv

I can give this a shot @nastra. Although I need to read the UUID PR first.

anuragmantri avatar Jul 08 '24 03:07 anuragmantri

I walked through the code and I was also able to reproduce this issue for parquet writes with a test.

java.lang.IllegalArgumentException: Invalid UUID string: d��Iu���>�M�`
	at java.base/java.util.UUID.fromString(Unknown Source)
	at org.apache.iceberg.spark.data.SparkParquetWriters$UUIDWriter.write(SparkParquetWriters.java:426)
	at org.apache.iceberg.spark.data.SparkParquetWriters$UUIDWriter.write(SparkParquetWriters.java:411)
	at org.apache.iceberg.parquet.ParquetValueWriters$StructWriter.write(ParquetValueWriters.java:581)
	at org.apache.iceberg.parquet.ParquetWriter.add(ParquetWriter.java:135)

It looks like the visitor incorrectly casts byte array to string because of our conversion to spark types here. Should we do this casting correctly at a higher level than SparkParquetWriters?

@RussellSpitzer @nastra

anuragmantri avatar Aug 02 '24 16:08 anuragmantri

@anuragmantri I believe this is the correct place to do the casting. Spark itself doesn't support UUID as a type and so you can only represent it as a string when you write a UUID.

nastra avatar Aug 05 '24 07:08 nastra

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Feb 02 '25 00:02 github-actions[bot]

No stale

raphaelauv avatar Feb 02 '25 09:02 raphaelauv

@anuragmantri are you still interested in picking this one up?

RussellSpitzer avatar Feb 02 '25 11:02 RussellSpitzer

Thanks for the ping, I missed this one. I will continue working on this.

I tried to add support uuid bytes earlier. Where I was stuck was that the uuid.uuid4().bytes is UUID in binary format. In the type conversion layer, we would have to differentiate between

  1. Binary - BinaryType in Spark
  2. a UUID represented as a binary - String in Spark

I was wondering if this can be done before TypeToSparkType.

Any thoughts on how this can be done @RussellSpitzer @nastra ?

anuragmantri avatar Feb 02 '25 23:02 anuragmantri