SolrTextTagger icon indicating copy to clipboard operation
SolrTextTagger copied to clipboard

FSTOrdPostingsFormat could enable faster Tagger

Open dsmiley opened this issue 7 years ago • 1 comments
trafficstars

The Lucene FSTOrdPostingsFormat (Solr schema postingsFormat="FSTOrd50") Is like FSTPostingsFormat but has "ordinals" -- term ordinals. Ordinals are not supported by most postings formats but this one has it. In TermPrefixCursor.java I left a comment that it could be more efficient we we could use ordinals. I think this might be true. Instead of eagerly reading & caching the postings (list of docIDs), we could just capture the ordinal (an int). This'd replace some of the "IntsRef" with this integer ordinal. TPC wouldn't need docIdsCache either. Later when we resolve it in getDocIds(), that's when we do the actual work which is perhaps not expensive. Sometimes we're never consulted to even do that, thus saving some time. The tag may have been eliminated due to overlapping, or it may have effectively been cached at a higher level (TaggerRequestHandler transforms to the uniqueKey values then caches that).

I'm not sure how much benefit this would bring; it could be net loss; hard to be sure.

Down side is we'd basically be limited to this PostingsFormat. At least the PostingsWriterBase aspect of this one is pluggable (kinda) should we want some future improvements to allow a total in-memory option. To ameliorate this down-side, we could support any PF via grabbing the "TermsState" instead, and presumably the termState of FSTOrdPostingsFormat is effectively the ordinal.

dsmiley avatar May 01 '18 15:05 dsmiley

Upon further inspection of FSTOrdPostringsFormat (actually FSTOrdTermsReader), it has TODOs for ord() which is bizarre -- why does this postingsFormat even exist if it doesn't yet support ords? I filed an issue: https://issues.apache.org/jira/browse/LUCENE-8285

dsmiley avatar May 01 '18 15:05 dsmiley