Parquet statistics missing when reading `Utf8` as `Utf8View`
Part of https://github.com/apache/datafusion/issues/11752
Describe the bug
One of the last remaining issues causing test failures when we enable reading StringView by default in https://github.com/apache/datafusion/pull/12092 is as follows:
failures:
datasource::file_format::parquet::tests::fetch_metadata_with_size_hint
datasource::file_format::parquet::tests::read_alltypes_plain_parquet
datasource::file_format::parquet::tests::read_binary_alltypes_plain_parquet
datasource::file_format::parquet::tests::read_merged_batches
datasource::file_format::parquet::tests::test_statistics_from_parquet_metadata
To Reproduce
https://github.com/apache/datafusion/pull/12092
And then run:
cargo test -p datafusion --lib -- file_format::parquet
Expected behavior
The tests should pass
Additional context
The problem is that table schema is configured to be UTF8View but the file schema is using Utf8 (so the stats are returned as Utf8) and the accumulators can't deal updating a Utf8View from Utf8.
@XiangpengHao solved this issue in https://github.com/apache/datafusion/pull/11862#discussion_r1727710645 to thread the parameter and then and cast the file schema appropriately.
The code isn't great to start with and adding a new parameter makes it worse.
I also think there are some bugs lurking there that maybe we could improve if the code was more testable
take
I unassigned myself because I'm not very familiar with this topic (StringView). I'll keep digging into the issue, but if anyone has an idea for a solution, feel free to take over.
Hi @alamb I just quickly drafted a PR for this issue, https://github.com/apache/datafusion/pull/12198 Instead of passing the config, I think we can check the schema from the table schema. I’m trying to coerce the UTF-8 field in the file schema if the corresponding field in the table schema is a UTF-8 view. Does that make sense?
Have an alternative solution, done in the process of fixing https://github.com/apache/datafusion/issues/12119. PR up shortly.
take