Spark: Add read/write support for UUIDs from bytes
Apache Iceberg version
1.5.2 (latest release)
Query engine
Spark
Please describe the bug 🐞
I can insert a string column to an iceberg UUID column thanks to https://github.com/apache/iceberg/pull/7399
df = df.withColumn("id", lit(str(uuid.uuid4())))
but I can't insert a byte column to an iceberg UUID column
df = df.withColumn("id", lit(uuid.uuid4().bytes))
thanks all
@raphaelauv would you be interested in contributing a fix for this?
hey @nastra, I do not have the time to contribute this feature right now, thanks for the proposition :+1:
until , I'm sharring an hacky bypass :sweat_smile: :
df = df.withColumn(
"id",
F.regexp_replace(
F.lower(F.hex("id")),
"(.{8})(.{4})(.{4})(.{4})(.{12})",
"$1-$2-$3-$4-$5"
)
)
I can give this a shot @nastra. Although I need to read the UUID PR first.
I walked through the code and I was also able to reproduce this issue for parquet writes with a test.
java.lang.IllegalArgumentException: Invalid UUID string: d��Iu���>�M�`
at java.base/java.util.UUID.fromString(Unknown Source)
at org.apache.iceberg.spark.data.SparkParquetWriters$UUIDWriter.write(SparkParquetWriters.java:426)
at org.apache.iceberg.spark.data.SparkParquetWriters$UUIDWriter.write(SparkParquetWriters.java:411)
at org.apache.iceberg.parquet.ParquetValueWriters$StructWriter.write(ParquetValueWriters.java:581)
at org.apache.iceberg.parquet.ParquetWriter.add(ParquetWriter.java:135)
It looks like the visitor incorrectly casts byte array to string because of our conversion to spark types here. Should we do this casting correctly at a higher level than SparkParquetWriters?
@RussellSpitzer @nastra
@anuragmantri I believe this is the correct place to do the casting. Spark itself doesn't support UUID as a type and so you can only represent it as a string when you write a UUID.
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
No stale
@anuragmantri are you still interested in picking this one up?
Thanks for the ping, I missed this one. I will continue working on this.
I tried to add support uuid bytes earlier. Where I was stuck was that the uuid.uuid4().bytes is UUID in binary format. In the type conversion layer, we would have to differentiate between
- Binary - BinaryType in Spark
- a UUID represented as a binary - String in Spark
I was wondering if this can be done before TypeToSparkType.
Any thoughts on how this can be done @RussellSpitzer @nastra ?