lucene
lucene copied to clipboard
`SynonymGraphFilter` should read FSTs off-heap?
Description
[Spinoff from #13004]
Recently we added off-heap FST reading, but only switched to it in limited cases, starting with the terms index in BlockTree
terms dictionary. Should we switch to off-heap for SynonymGraphFilter
too?
I was looking into how to implement this and I think I've mostly got it -- essentially, I would write the SynonymMap
to a file (which could be an offline operation, basically "precompile your SynonymMap" so you don't need to parse + compile it on startup).
What's got me stuck is that OffHeapFSTStore
takes an IndexInput
, which AFAIK should only be returned from a Directory
. We don't want to write the SynonymMap
to the index where it's used, right?
Huh... I guess we could use a separate, sidecar directory for the precompiled SynonymMap. That directory could optionally be passed to SynonymGraphFilterFactory
to let it load a precompiled (off-heap) SynonymMap
.
Does it sound like I am on the right path?
I have a (rough) PR to address this: https://github.com/apache/lucene/pull/13054.
I also moved the output word lookup off-heap, but it requires a random seek (within a hopefully MMapped file) before every lookup.