iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Stale column stats getting reported when reading puffin files generated by Presto with Spark engine

Open jeesou opened this issue 5 months ago • 1 comments

Apache Iceberg version

1.6.0 (latest release)

Query engine

Spark

Please describe the bug 🐞

As we work with multiple engines like Spark, Presto, Flink, etc., interoperability is crucial.

Recently, PRs [#10288](https://github.com/apache/iceberg/pull/10288) and [#10659](https://github.com/apache/iceberg/pull/10659) added Columnar stats support, specifically NDV (Number of Distinct Values) for Spark. Based on this, we can generate the Puffin (`.stat`) file, which can be used by Spark for better query planning.

The Puffin file generation logic was already developed internally by Presto, which creates the Puffin file when running the `ANALYZE` query.

According to the interoperability and Puffin specification, any Puffin file generated by one engine should be usable by another engine. However, we encountered issues with Spark. The Puffin file generated by Presto contains another `blob-metadata` entry, i.e., `"data_size"` of type `presto-sum-data-size-bytes-v1`, while `ndv` is of type `apache-datasketches-theta-v1`.

When we used the Puffin file generated by Presto in Spark, the `ndv` stats were being overwritten to `null` for columns where the `"data_size"` blob was present, particularly for string type fields.

To resolve this, we need to ensure that Puffin files generated by Presto can be correctly utilized by Spark.

Willingness to contribute

  • [ ] I can contribute a fix for this bug independently
  • [X] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • [ ] I cannot contribute a fix for this bug at this time

jeesou avatar Aug 28 '24 07:08 jeesou