ClickHouse icon indicating copy to clipboard operation
ClickHouse copied to clipboard

What if the inverted index record only a mark number of the rows from the same granule

Open ardenwick opened this issue 1 year ago • 4 comments

In currently implementation of inverted index, it saves row IDs where a term appears in the posting list. At query time, row ID range of each index granule is matched against the posting list to check if the granule contains any of row IDs of the term. For a matching granule, it returns true from mayBeTrueOnGranuleInPart.

If the posting list records only a granule ID (i.e. the mark number) for rows in the granule, the cardinality can be greatly reduced.

ardenwick avatar Apr 14 '24 17:04 ardenwick

There is no problem with your idea. We have modified and tested it based on this idea before. On our machine, the performance of the inverted index is equivalent to that of the primary key query.

juppylm avatar May 11 '24 07:05 juppylm

When the inverted index is imported into the library, granule.mark_number is written instead of rowid. When querying, filter by MarkRange.

juppylm avatar May 11 '24 07:05 juppylm

Please check #62706 , if the presented improvement is proper.

ardenwick avatar May 11 '24 08:05 ardenwick

Mapping terms to granule IDs will make the inverted index behave like a bloom filter with zero false-positive rate. So I turned to using a 'divisor'.

ardenwick avatar May 11 '24 08:05 ardenwick