iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Spark: Support Parquet dictionary encoded UUIDs

Open Fokko opened this issue 6 months ago • 2 comments

While fixing some issues on the PyIceberg ends to fully support UUIDs: https://github.com/apache/iceberg-python/pull/2007

I noticed this issue, and was suprised since UUID used to work with Spark, but it turns out that the dictionary encoded UUID was not implemented yet.

For PyIceberg we only generate little data, so therefore this wasn't caught previously.

Closes https://github.com/apache/iceberg/issues/4581

Fokko avatar Jun 16 '25 11:06 Fokko

@DinGo4DEV Yes, we have tests for plain encoded UUIDs, let me add one for dictionary encoded UUIDs as well 👍

Fokko avatar Jun 18 '25 05:06 Fokko

Is there a way to test this? Can we add a dictionary encoded UUID like this?

Just found out that the test above is not testing this code path, since Spark projects a UUID into a String.

Test has been added and checked using breakpoints that it hits the newly added lines 👍

Fokko avatar Jun 18 '25 08:06 Fokko

Thanks @RussellSpitzer, @kevinjqliu and @dingo4dev for the review 🚀

Fokko avatar Jul 09 '25 22:07 Fokko

@Fokko are you planning to port this over to Spark 4? I think it would be good to get this out with 1.10

nastra avatar Jul 10 '25 05:07 nastra

and perhaps spark 3.4 as well https://grep.app/search?f.repo=apache%2Ficeberg&q=UTF8String+ofRow%28FixedSizeBinaryVector

kevinjqliu avatar Jul 11 '25 01:07 kevinjqliu