Spark: Support Parquet dictionary encoded UUIDs
While fixing some issues on the PyIceberg ends to fully support UUIDs: https://github.com/apache/iceberg-python/pull/2007
I noticed this issue, and was suprised since UUID used to work with Spark, but it turns out that the dictionary encoded UUID was not implemented yet.
For PyIceberg we only generate little data, so therefore this wasn't caught previously.
Closes https://github.com/apache/iceberg/issues/4581
@DinGo4DEV Yes, we have tests for plain encoded UUIDs, let me add one for dictionary encoded UUIDs as well 👍
Is there a way to test this? Can we add a dictionary encoded UUID like this?
Just found out that the test above is not testing this code path, since Spark projects a UUID into a String.
Test has been added and checked using breakpoints that it hits the newly added lines 👍
Thanks @RussellSpitzer, @kevinjqliu and @dingo4dev for the review 🚀
@Fokko are you planning to port this over to Spark 4? I think it would be good to get this out with 1.10
and perhaps spark 3.4 as well https://grep.app/search?f.repo=apache%2Ficeberg&q=UTF8String+ofRow%28FixedSizeBinaryVector