Distinct stat is not written to parquet metadata
What happens?
When writing a table to a parquet file, the distinct stat is not set.
To Reproduce
CREATE TABLE TEST(A int);
INSERT INTO TEST VALUES(0);
COPY (SELECT * FROM TEST) TO 'test.parquet' (FORMAT 'parquet');
SELECT * FROM parquet_metadata('test.parquet');
Inspect the "stats_distinct_count" column.
Environment (please complete the following information):
- OS: Ubuntu 20.04.4 LTS
- DuckDB Version: v0.4.0 da9ee490d
- DuckDB Client: CLI
Identity Disclosure:
- Full Name: Halvor Linder Henriksen
- Affiliation: Huawei
In order to write the exact distinct count to parquet files we would need to run a count(distinct(col)) for every column, which is very expensive.
DuckDB only stores an approximate distinct count for base tables, which it potentially could write to the parquet metadata. However, this could have some bad consequences if the parquet file is then read by a parquet reader that assumes this distinct count to be accurate.
We could likely set it for dictionary-encoded string columns, since we know the distinct count from the dictionary.
it will be nice to have it, it is better to pay a price for write then for read later:)
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.