lucene icon indicating copy to clipboard operation
lucene copied to clipboard

`SynonymGraphFilter` should read FSTs off-heap?

Open mikemccand opened this issue 1 year ago • 2 comments

Description

[Spinoff from #13004]

Recently we added off-heap FST reading, but only switched to it in limited cases, starting with the terms index in BlockTree terms dictionary. Should we switch to off-heap for SynonymGraphFilter too?

mikemccand avatar Jan 09 '24 19:01 mikemccand

I was looking into how to implement this and I think I've mostly got it -- essentially, I would write the SynonymMap to a file (which could be an offline operation, basically "precompile your SynonymMap" so you don't need to parse + compile it on startup).

What's got me stuck is that OffHeapFSTStore takes an IndexInput, which AFAIK should only be returned from a Directory. We don't want to write the SynonymMap to the index where it's used, right?

Huh... I guess we could use a separate, sidecar directory for the precompiled SynonymMap. That directory could optionally be passed to SynonymGraphFilterFactory to let it load a precompiled (off-heap) SynonymMap.

Does it sound like I am on the right path?

msfroh avatar Jan 16 '24 22:01 msfroh

I have a (rough) PR to address this: https://github.com/apache/lucene/pull/13054.

I also moved the output word lookup off-heap, but it requires a random seek (within a hopefully MMapped file) before every lookup.

msfroh avatar Jan 30 '24 08:01 msfroh