lucene
lucene copied to clipboard
Make FSTPostingFormat to build FST off-heap
Description
This is an attempt to make FSTPostingFormat to write the FST off-heap. Instead of write it on-heap then save to disk, we configure the compiler to write the FST off-heap right from the start.
Some additional changes:
- As we can't write the FST metadata and FST data on the same file, now we need to break the
tfpfile into 2 files:tfp.metaandtfp.data - We need to write the starting address of the FST data in the posting metadata file, then seek to that address when read
An alternative way is to copy the written FST back into metadata file, but that will slow down the writing.
I found another quite tricky issue:
If we write the FST directly to the IndexOutput, there might be a chance that there's no term accepted by the FST, in that case we still write the padding 0 byte. This padding byte is to ensure no node having the 0 address: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java#L172-L174
However, since we are writing the FST consecutively for each field, appending to the same file, that means there could be a case we still write that additional padding byte, which is mapped to no field: [ 0 ] [ FST_Field_1 ] [ 0 ] [ 0 ] [ FST_Field_3 ]
There is only 1 failed test left: TestFSTPostingFormat.testRandomException
> Caused by:
> java.lang.RuntimeException: unclosed IndexOutput: _s_FST50_0.tfp.meta
> at org.apache.lucene.tests.store.MockDirectoryWrapper.addFileHandle(MockDirectoryWrapper.java:783)
Seems like some file might not be closed correctly when there are exception
Fixed the above unclosed issue by moving openInput and createOutput to try-catch block
The test passed. I'll add a change log, some more comments
I added the change log, increased the FSTPostingsFormat version (isn't entirely related to this PR, but it seems the naming convention is outdated). The change for FSTCompiler can be merged as part of #12981 first. Will published the PR when the tests finished
I'm not sure why FSTPostingsFormat is different from the rest, that it write both the metadata and data to the same file. I think writing to separate files would be cleaner and more consistent.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
I'm not sure why FSTPostingsFormat is different from the rest, that it write both the metadata and data to the same file. I think writing to separate files would be cleaner and more consistent.
~Let's defer this for a future issue?~ (Edit: nevermind -- looks like you already tackled this in the PR, great!). This is a rarely used experimental postings format, and might be replaced by the new postings format in https://github.com/apache/lucene/pull/12688.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
Thanks @mikemccand for the clarification!
Do you think we should still make this change? One benefit is that it can be used for reference. Otherwise I'll close this PR
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!