spark [SPARK-48698][SQL] Support analyze column stats for tables with collated columns

[SPARK-48698][SQL] Support analyze column stats for tables with collated columns

Open nikolamand-db opened this issue 8 months ago • 0 comments

What changes were proposed in this pull request?

Following sequence fails:

> create table t(s string collate utf8_lcase) using parquet;
> insert into t values ('A');
> analyze table t compute statistics for all columns;
[UNSUPPORTED_FEATURE.ANALYZE_UNSUPPORTED_COLUMN_TYPE] The feature is not supported: The ANALYZE TABLE FOR COLUMNS command does not support the type "STRING COLLATE UTF8_LCASE" of the column `s` in the table `spark_catalog`.`default`.`t`. SQLSTATE: 0A000

Users should be able to run ANALYZE (column stats computation) commands on tables which have columns with collated type.

Add support for column stats computation by:

Updating pattern matching to include all StringType subtypes in stats computation execution code
Update HyperLogLogPlusPlus to support calculating approximate count for collated strings as well; this is one of the computed statistics in ANALYZE command
Add tests to check new collated HyperLogLogPlusPlus behavior
Add tests to check statistics computation of collated data

Why are the changes needed?

To properly support statistics computation for collated columns.

Does this PR introduce any user-facing change?

Yes, it changes how statistics computation behaves when being performed on collated columns.

How was this patch tested?

Added checks to CollationSuite and CollationSQLExpressionsSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

Jun 24 '24 14:06 nikolamand-db

spark spark copied to clipboard

[SPARK-48698][SQL] Support analyze column stats for tables with collated columns

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

spark
spark copied to clipboard