spark
spark copied to clipboard
[SPARK-48698][SQL] Support analyze column stats for tables with collated columns
What changes were proposed in this pull request?
Following sequence fails:
> create table t(s string collate utf8_lcase) using parquet;
> insert into t values ('A');
> analyze table t compute statistics for all columns;
[UNSUPPORTED_FEATURE.ANALYZE_UNSUPPORTED_COLUMN_TYPE] The feature is not supported: The ANALYZE TABLE FOR COLUMNS command does not support the type "STRING COLLATE UTF8_LCASE" of the column `s` in the table `spark_catalog`.`default`.`t`. SQLSTATE: 0A000
Users should be able to run ANALYZE
(column stats computation) commands on tables which have columns with collated type.
Add support for column stats computation by:
- Updating pattern matching to include all
StringType
subtypes in stats computation execution code - Update
HyperLogLogPlusPlus
to support calculating approximate count for collated strings as well; this is one of the computed statistics inANALYZE
command - Add tests to check new collated
HyperLogLogPlusPlus
behavior - Add tests to check statistics computation of collated data
Why are the changes needed?
To properly support statistics computation for collated columns.
Does this PR introduce any user-facing change?
Yes, it changes how statistics computation behaves when being performed on collated columns.
How was this patch tested?
Added checks to CollationSuite
and CollationSQLExpressionsSuite
.
Was this patch authored or co-authored using generative AI tooling?
No.