pinot icon indicating copy to clipboard operation
pinot copied to clipboard

Provide true real-time indexing for Lucene based text index

Open itschrispeck opened this issue 8 months ago • 0 comments

Problem

Currently, Pinot's RealtimeLuceneTextIndex uses Lucene's near real-time indexing functionality. Some effort has been made to reduce the delay already. However, due to the nature of the implementation true real-time indexing is still missing.

This behavior presents in a couple ways:

  1. text_match(col, '"abcd"') -> forward match misses the most recent docs
  2. NOT text_match(col, '"abcd"') -> inverse match fails to exclude the most recent docs, so users will see docs containing abcd
  3. Missing results for upsert, for example:
    t0: doc A ingested/doc A is the valid doc based on upsert lastest docs
    t1: doc A text indexed, doc A searchable w/ text index
    t2: doc B ingested/doc B is the valid doc based on upsert latest docs
    <text_match query returns doc A, but upsert invalidated doc A, no results>
    t3: doc B text indexed, doc B searchable w/ text index
    <text_match query returns doc B, doc B is searchable w/ text index and a valid doc, expected results>
    

With delay minimized, we can provide a small, in-memory, true realtime index to bridge the gap between NRT functionality and docs ingested in Pinot using Lucene primitives.

Alternatives considered:

  • bound the most recent doc considered during query execution based on index refresh delay

    • For the V1 query engine, I think this can be done in FilterOperatorUtils by 'adjusting' numDocs if the data source has a text index.
    • This does not solve the freshness issues (inconsistenties w/ query response metadata), but will avoid the correctness issues seen by inverse match.
    • This does not solve the upsert case, but changes the scope of the issue from results = {correct, extraneous, missing} to results = {correct, missing}
  • rewrite NOT text_match(col, '"abcd"') to text_match(col, '/.*/ AND NOT "abcd"')

    • this carries some unwanted performance implications, but could be used to guarantee query correctness (i.e. don't include results that should be excluded)

itschrispeck avatar Jun 27 '24 21:06 itschrispeck