spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-48698][SQL] Support analyze column stats for tables with collated columns

Open nikolamand-db opened this issue 8 months ago • 0 comments

What changes were proposed in this pull request?

Following sequence fails:

> create table t(s string collate utf8_lcase) using parquet;
> insert into t values ('A');
> analyze table t compute statistics for all columns;
[UNSUPPORTED_FEATURE.ANALYZE_UNSUPPORTED_COLUMN_TYPE] The feature is not supported: The ANALYZE TABLE FOR COLUMNS command does not support the type "STRING COLLATE UTF8_LCASE" of the column `s` in the table `spark_catalog`.`default`.`t`. SQLSTATE: 0A000

Users should be able to run ANALYZE (column stats computation) commands on tables which have columns with collated type.

Add support for column stats computation by:

  • Updating pattern matching to include all StringType subtypes in stats computation execution code
  • Update HyperLogLogPlusPlus to support calculating approximate count for collated strings as well; this is one of the computed statistics in ANALYZE command
  • Add tests to check new collated HyperLogLogPlusPlus behavior
  • Add tests to check statistics computation of collated data

Why are the changes needed?

To properly support statistics computation for collated columns.

Does this PR introduce any user-facing change?

Yes, it changes how statistics computation behaves when being performed on collated columns.

How was this patch tested?

Added checks to CollationSuite and CollationSQLExpressionsSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

nikolamand-db avatar Jun 24 '24 14:06 nikolamand-db