greptimedb
greptimedb copied to clipboard
Support `APPROX_COUNT_DISTINCT ` in SQL
What problem does the new feature solve?
Current COUNT
operator relies on loading all SST files and deduplicating rows to calculate an accurate row count. That would be unnecessary when users only want to known an approximate row count, especially in time-series scenarios, we can provide an APPROX_COUNT_DISTINCT
instead.
What does the feature do?
Calculate row count solely based on SST metadata.
Implementation challenges
Sum the row count of SST files
When building ParquetRecordBatchStreamBuilder
:
https://github.com/GreptimeTeam/greptimedb/blob/d402f8344271996ae33fd166ca64df41358281b1/src/storage/src/sst/parquet.rs#L245
We can use builder.metadata().file_metadata().num_rows()
to directly read num of rows in SST file based on file metadata.
Then APPROX_COUNT_DISTINCT
value is "num_rows of all SST files + memtable size"
Caveats
This implementation does not merge duplicated rows residing in different SST files. But for most time-series scenario it would be acceptable since update/delete are rare.
@v0y4g3r I'm trying to pick up this issue, but our codebase seems migrate a lot. Can you revise the issue a bit where we can start implementing APPROX_COUNT_DISTINCT?