lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Make FSTPostingFormat to build FST off-heap

Open dungba88 opened this issue 1 year ago • 10 comments
trafficstars

Description

This is an attempt to make FSTPostingFormat to write the FST off-heap. Instead of write it on-heap then save to disk, we configure the compiler to write the FST off-heap right from the start.

Some additional changes:

  • As we can't write the FST metadata and FST data on the same file, now we need to break the tfp file into 2 files: tfp.meta and tfp.data
  • We need to write the starting address of the FST data in the posting metadata file, then seek to that address when read

An alternative way is to copy the written FST back into metadata file, but that will slow down the writing.

dungba88 avatar Dec 26 '23 06:12 dungba88

I found another quite tricky issue:

If we write the FST directly to the IndexOutput, there might be a chance that there's no term accepted by the FST, in that case we still write the padding 0 byte. This padding byte is to ensure no node having the 0 address: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/fst/FSTCompiler.java#L172-L174

However, since we are writing the FST consecutively for each field, appending to the same file, that means there could be a case we still write that additional padding byte, which is mapped to no field: [ 0 ] [ FST_Field_1 ] [ 0 ] [ 0 ] [ FST_Field_3 ]

dungba88 avatar Dec 26 '23 09:12 dungba88

There is only 1 failed test left: TestFSTPostingFormat.testRandomException

   >         Caused by:
   >         java.lang.RuntimeException: unclosed IndexOutput: _s_FST50_0.tfp.meta
   >             at org.apache.lucene.tests.store.MockDirectoryWrapper.addFileHandle(MockDirectoryWrapper.java:783)

Seems like some file might not be closed correctly when there are exception

dungba88 avatar Dec 27 '23 01:12 dungba88

Fixed the above unclosed issue by moving openInput and createOutput to try-catch block

The test passed. I'll add a change log, some more comments

dungba88 avatar Dec 27 '23 03:12 dungba88

I added the change log, increased the FSTPostingsFormat version (isn't entirely related to this PR, but it seems the naming convention is outdated). The change for FSTCompiler can be merged as part of #12981 first. Will published the PR when the tests finished

dungba88 avatar Dec 28 '23 08:12 dungba88

I'm not sure why FSTPostingsFormat is different from the rest, that it write both the metadata and data to the same file. I think writing to separate files would be cleaner and more consistent.

dungba88 avatar Jan 11 '24 13:01 dungba88

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Jan 26 '24 00:01 github-actions[bot]

I'm not sure why FSTPostingsFormat is different from the rest, that it write both the metadata and data to the same file. I think writing to separate files would be cleaner and more consistent.

~Let's defer this for a future issue?~ (Edit: nevermind -- looks like you already tackled this in the PR, great!). This is a rarely used experimental postings format, and might be replaced by the new postings format in https://github.com/apache/lucene/pull/12688.

mikemccand avatar Feb 08 '24 12:02 mikemccand

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Feb 23 '24 00:02 github-actions[bot]

Thanks @mikemccand for the clarification!

Do you think we should still make this change? One benefit is that it can be used for reference. Otherwise I'll close this PR

dungba88 avatar Feb 28 '24 01:02 dungba88

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Mar 15 '24 00:03 github-actions[bot]