greptimedb icon indicating copy to clipboard operation
greptimedb copied to clipboard

Support `APPROX_COUNT_DISTINCT ` in SQL

Open v0y4g3r opened this issue 1 year ago • 1 comments

What problem does the new feature solve?

Current COUNT operator relies on loading all SST files and deduplicating rows to calculate an accurate row count. That would be unnecessary when users only want to known an approximate row count, especially in time-series scenarios, we can provide an APPROX_COUNT_DISTINCT instead.

What does the feature do?

Calculate row count solely based on SST metadata.

Implementation challenges

Sum the row count of SST files

When building ParquetRecordBatchStreamBuilder:

https://github.com/GreptimeTeam/greptimedb/blob/d402f8344271996ae33fd166ca64df41358281b1/src/storage/src/sst/parquet.rs#L245

We can use builder.metadata().file_metadata().num_rows() to directly read num of rows in SST file based on file metadata.

Then APPROX_COUNT_DISTINCT value is "num_rows of all SST files + memtable size"

Caveats

This implementation does not merge duplicated rows residing in different SST files. But for most time-series scenario it would be acceptable since update/delete are rare.

v0y4g3r avatar Mar 24 '23 07:03 v0y4g3r

@v0y4g3r I'm trying to pick up this issue, but our codebase seems migrate a lot. Can you revise the issue a bit where we can start implementing APPROX_COUNT_DISTINCT?

tisonkun avatar Apr 26 '24 14:04 tisonkun