iceberg Spark: Support Parquet dictionary encoded UUIDs

While fixing some issues on the PyIceberg ends to fully support UUIDs: https://github.com/apache/iceberg-python/pull/2007

I noticed this issue, and was suprised since UUID used to work with Spark, but it turns out that the dictionary encoded UUID was not implemented yet.

For PyIceberg we only generate little data, so therefore this wasn't caught previously.

Closes https://github.com/apache/iceberg/issues/4581

Jun 16 '25 11:06 Fokko

@DinGo4DEV Yes, we have tests for plain encoded UUIDs, let me add one for dictionary encoded UUIDs as well 👍

Jun 18 '25 05:06 Fokko

Is there a way to test this? Can we add a dictionary encoded UUID like this?

Just found out that the test above is not testing this code path, since Spark projects a UUID into a String.

Test has been added and checked using breakpoints that it hits the newly added lines 👍

Jun 18 '25 08:06 Fokko

Thanks @RussellSpitzer, @kevinjqliu and @dingo4dev for the review 🚀

Jul 09 '25 22:07 Fokko

@Fokko are you planning to port this over to Spark 4? I think it would be good to get this out with 1.10

Jul 10 '25 05:07 nastra

and perhaps spark 3.4 as well https://grep.app/search?f.repo=apache%2Ficeberg&q=UTF8String+ofRow%28FixedSizeBinaryVector

Jul 11 '25 01:07 kevinjqliu