duckdb icon indicating copy to clipboard operation
duckdb copied to clipboard

Distinct stat is not written to parquet metadata

Open halvorlinder opened this issue 3 years ago • 3 comments

What happens?

When writing a table to a parquet file, the distinct stat is not set.

To Reproduce

CREATE TABLE TEST(A int);
INSERT INTO TEST VALUES(0);
COPY (SELECT * FROM TEST) TO 'test.parquet' (FORMAT 'parquet');
SELECT * FROM parquet_metadata('test.parquet');

Inspect the "stats_distinct_count" column.

Environment (please complete the following information):

  • OS: Ubuntu 20.04.4 LTS
  • DuckDB Version: v0.4.0 da9ee490d
  • DuckDB Client: CLI

Identity Disclosure:

  • Full Name: Halvor Linder Henriksen
  • Affiliation: Huawei

halvorlinder avatar Jul 21 '22 12:07 halvorlinder

In order to write the exact distinct count to parquet files we would need to run a count(distinct(col)) for every column, which is very expensive.

DuckDB only stores an approximate distinct count for base tables, which it potentially could write to the parquet metadata. However, this could have some bad consequences if the parquet file is then read by a parquet reader that assumes this distinct count to be accurate.

lnkuiper avatar Jul 21 '22 13:07 lnkuiper

We could likely set it for dictionary-encoded string columns, since we know the distinct count from the dictionary.

Mytherin avatar Jul 21 '22 13:07 Mytherin

it will be nice to have it, it is better to pay a price for write then for read later:)

djouallah avatar Jul 21 '22 22:07 djouallah

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions[bot] avatar Jul 30 '23 00:07 github-actions[bot]

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions[bot] avatar Oct 30 '23 00:10 github-actions[bot]