Restore fsync behavior in FSDirectory via P/Invoke, #933
- [X] You've read the Contributor Guide and Code of Conduct.
- [X] You've included unit or integration tests for your change, where applicable. (Existing unit tests)
- [X] You've included inline docs for your change, where applicable.
- [X] There's an open issue for the PR that you are making. If you'd like to propose a change, please open an issue to discuss the change or find an existing issue.
This restores the commented-out fsync behavior in FSDirectory to help mitigate a performance regression in .NET 8.
Fixes #933
Description
See 933 for more details. Perf benchmark before:
| Method | Job | Runtime | IterationCount | LaunchCount | WarmupCount | N | Mean | Error | StdDev | Median | Ratio | RatioSD |
|------------------------ |--------- |--------- |--------------- |------------ |------------ |--- |----------:|-----------:|---------:|----------:|------:|--------:|
| IndexDocumentsBenchmark | .NET 7.0 | .NET 7.0 | Default | Default | Default | 10 | 44.11 ms | 3.756 ms | 11.02 ms | 40.34 ms | 1.00 | 0.00 |
| IndexDocumentsBenchmark | .NET 8.0 | .NET 8.0 | Default | Default | Default | 10 | 843.36 ms | 16.485 ms | 24.16 ms | 843.93 ms | 19.11 | 4.86 |
After:
| Method | Job | Runtime | IterationCount | LaunchCount | WarmupCount | N | Mean | Error | StdDev | Median | Ratio | RatioSD |
|------------------------ |--------- |--------- |--------------- |------------ |------------ |--- |---------:|-----------:|----------:|---------:|------:|--------:|
| IndexDocumentsBenchmark | .NET 7.0 | .NET 7.0 | Default | Default | Default | 10 | 50.15 ms | 5.044 ms | 14.874 ms | 54.12 ms | 1.00 | 0.00 |
| IndexDocumentsBenchmark | .NET 8.0 | .NET 8.0 | Default | Default | Default | 10 | 42.88 ms | 4.736 ms | 13.964 ms | 38.17 ms | 0.95 | 0.50 |
I am going to do a check with the TestIndexWriterOnJRECrash test to try to confirm this is a fix for it. There is an issue with the test itself (something about the system property with the temp file name not being passed through) that I am working on a fix for now.
Looks like it still fails on Windows sometimes. I think it is more robust than it was, though. I had to try with [Repeat(25)] to get it to fail. I will try in Azure DevOps with the attribute to see whether it will fail on Linux or macOS.
TestNRTThreads_Mem
Source: TestIndexWriterOnJRECrash.cs line 51
Duration: 34.3 sec
Message:
Lucene.Net.Util.LuceneSystemException : CheckIndex failed
To reproduce this test result:
Option 1:
Apply the following assembly-level attributes:
[assembly: Lucene.Net.Util.RandomSeed("0x58551ebf9b5e10f6")]
[assembly: NUnit.Framework.SetCulture("en-AG")]
Option 2:
Use the following .runsettings file:
<RunSettings>
<TestRunParameters>
<Parameter name="tests:seed" value="0x58551ebf9b5e10f6" />
<Parameter name="tests:culture" value="en-AG" />
</TestRunParameters>
</RunSettings>
See the .runsettings documentation at: https://docs.microsoft.com/en-us/visualstudio/test/configure-unit-tests-by-using-a-dot-runsettings-file.
Stack Trace:
TestUtil.CheckIndex(Directory dir, Boolean crossCheckTermVectors) line 177
TestUtil.CheckIndex(Directory dir) line 162
TestIndexWriterOnJRECrash.CheckIndexes(FileSystemInfo file) line 322
TestIndexWriterOnJRECrash.CheckIndexes(FileSystemInfo file) line 329
TestIndexWriterOnJRECrash.CheckIndexes(FileSystemInfo file) line 329
TestIndexWriterOnJRECrash.TestNRTThreads_Mem() line 88
InvokeStub_TestIndexWriterOnJRECrash.TestNRTThreads_Mem(Object, Object, IntPtr*)
MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr)
Standard Output:
CheckIndex failed
Segments file=segments_2 numSegments=3 version=4.8 format=
1 of 3: name=_9 docCount=162
codec=Lucene46
compound=True
numFiles=4
size (MB)=0.34093189239501953
diagnostics = {source=merge, lucene.version=4.8.0, os=Microsoft Windows, os.arch=X64, os.version=10.0.19045, java.version=8.0.1, java.vendor=Microsoft, timestamp=15/05/2024 10:17:05 am, mergeMaxNumSegments=-1, mergeFactor=5}
has deletions [delGen=1]
test: open reader.........OK
test: check integrity.....OK
test: check live docs.....OK [32 deleted docs]
test: fields..............OK [26 fields]
test: field norms.........OK [21 fields]
test: terms, freq, prox...OK [6532 terms; 9562 terms/docs pairs; 12115 tokens]
test (ignoring deletes): terms, freq, prox...OK [7799 terms; 11823 terms/docs pairs; 14862 tokens]
test: stored fields.......OK [660 total field count; avg 5.076923 fields per doc]
test: term vectors........OK [265 total vector count; avg 2.0384614 term/freq vector fields per doc]
test: docvalues...........OK [1 docvalues fields; 0 BINARY; 0 NUMERIC; 1 SORTED; 0 SORTED_SET]
2 of 3: name=_8 docCount=56
codec=Lucene46
compound=False
numFiles=33
size (MB)=0.18136310577392578
diagnostics = {source=flush, lucene.version=4.8.0, os=Microsoft Windows, os.arch=X64, os.version=10.0.19045, java.version=8.0.1, java.vendor=Microsoft, timestamp=15/05/2024 10:17:06 am}
has deletions [delGen=1]
test: open reader.........OK
test: check integrity.....OK
test: check live docs.....OK [2 deleted docs]
test: fields..............OK [13 fields]
test: field norms.........OK [8 fields]
test: terms, freq, prox...OK [3962 terms; 4916 terms/docs pairs; 7363 tokens]
test (ignoring deletes): terms, freq, prox...OK [4035 terms; 5024 terms/docs pairs; 7480 tokens]
test: stored fields.......OK [282 total field count; avg 5.2222223 fields per doc]
test: term vectors........OK [112 total vector count; avg 2.074074 term/freq vector fields per doc]
test: docvalues...........OK [1 docvalues fields; 0 BINARY; 0 NUMERIC; 1 SORTED; 0 SORTED_SET]
3 of 3: name=_7 docCount=44
codec=Lucene46
compound=False
numFiles=31
size (MB)=0.5951318740844727
diagnostics = {source=flush, lucene.version=4.8.0, os=Microsoft Windows, os.arch=X64, os.version=10.0.19045, java.version=8.0.1, java.vendor=Microsoft, timestamp=15/05/2024 10:17:06 am}
has deletions [delGen=1]
test: open reader.........OK
test: check integrity.....OK
test: check live docs.....OK [1 deleted docs]
test: fields..............OK [18 fields]
test: field norms.........OK [13 fields]
test: terms, freq, prox...ERROR: Lucene.Net.Diagnostics.AssertionException: buffer.length=190007630 count=2142165037
at Lucene.Net.Codecs.MockIntBlock.MockVariableInt32BlockPostingsFormat.MockInt32Factory.VariableInt32BlockIndexInputAnonymousClass.BlockReaderAnonymousClass.ReadBlock() in F:\Projects\lucenenet\src\Lucene.Net.TestFramework\Codecs\MockIntBlock\MockVariableIntBlockPostingsFormat.cs:line 110
at Lucene.Net.Codecs.IntBlock.VariableInt32BlockIndexInput.Reader.MaybeSeek() in F:\Projects\lucenenet\src\Lucene.Net.Codecs\IntBlock\VariableIntBlockIndexInput.cs:line 152
at Lucene.Net.Codecs.IntBlock.VariableInt32BlockIndexInput.Reader.Next() in F:\Projects\lucenenet\src\Lucene.Net.Codecs\IntBlock\VariableIntBlockIndexInput.cs:line 160
at Lucene.Net.Codecs.Sep.SepPostingsReader.SepDocsAndPositionsEnum.NextDoc() in F:\Projects\lucenenet\src\Lucene.Net.Codecs\Sep\SepPostingsReader.cs:line 561
at Lucene.Net.Index.CheckIndex.CheckFields(Fields fields, IBits liveDocs, Int32 maxDoc, FieldInfos fieldInfos, Boolean doPrint, Boolean isVectors, TextWriter infoStream, Boolean verbose) in F:\Projects\lucenenet\src\Lucene.Net\Index\CheckIndex.cs:line 1156
at Lucene.Net.Index.CheckIndex.TestPostings(AtomicReader reader, TextWriter infoStream, Boolean verbose) in F:\Projects\lucenenet\src\Lucene.Net\Index\CheckIndex.cs:line 1651
Lucene.Net.Diagnostics.AssertionException: buffer.length=190007630 count=2142165037
at Lucene.Net.Codecs.MockIntBlock.MockVariableInt32BlockPostingsFormat.MockInt32Factory.VariableInt32BlockIndexInputAnonymousClass.BlockReaderAnonymousClass.ReadBlock() in F:\Projects\lucenenet\src\Lucene.Net.TestFramework\Codecs\MockIntBlock\MockVariableIntBlockPostingsFormat.cs:line 110
at Lucene.Net.Codecs.IntBlock.VariableInt32BlockIndexInput.Reader.MaybeSeek() in F:\Projects\lucenenet\src\Lucene.Net.Codecs\IntBlock\VariableIntBlockIndexInput.cs:line 152
at Lucene.Net.Codecs.IntBlock.VariableInt32BlockIndexInput.Reader.Next() in F:\Projects\lucenenet\src\Lucene.Net.Codecs\IntBlock\VariableIntBlockIndexInput.cs:line 160
at Lucene.Net.Codecs.Sep.SepPostingsReader.SepDocsAndPositionsEnum.NextDoc() in F:\Projects\lucenenet\src\Lucene.Net.Codecs\Sep\SepPostingsReader.cs:line 561
at Lucene.Net.Index.CheckIndex.CheckFields(Fields fields, IBits liveDocs, Int32 maxDoc, FieldInfos fieldInfos, Boolean doPrint, Boolean isVectors, TextWriter infoStream, Boolean verbose) in F:\Projects\lucenenet\src\Lucene.Net\Index\CheckIndex.cs:line 1156
at Lucene.Net.Index.CheckIndex.TestPostings(AtomicReader reader, TextWriter infoStream, Boolean verbose) in F:\Projects\lucenenet\src\Lucene.Net\Index\CheckIndex.cs:line 1651
test: stored fields.......OK [221 total field count; avg 5.139535 fields per doc]
test: term vectors........ERROR [Exception of type 'Lucene.Net.Diagnostics.AssertionException' was thrown.]
Lucene.Net.Diagnostics.AssertionException: Exception of type 'Lucene.Net.Diagnostics.AssertionException' was thrown.
at Lucene.Net.Codecs.IntBlock.FixedInt32BlockIndexInput.Index.Read(DataInput indexIn, Boolean absolute) in F:\Projects\lucenenet\src\Lucene.Net.Codecs\IntBlock\FixedIntBlockIndexInput.cs:line 177
at Lucene.Net.Codecs.Sep.SepPostingsReader.DecodeTerm(Int64[] empty, DataInput input, FieldInfo fieldInfo, BlockTermState termState, Boolean absolute) in F:\Projects\lucenenet\src\Lucene.Net.Codecs\Sep\SepPostingsReader.cs:line 202
at Lucene.Net.Codecs.BlockTerms.BlockTermsReader.FieldReader.SegmentTermsEnum.DecodeMetaData() in F:\Projects\lucenenet\src\Lucene.Net.Codecs\BlockTerms\BlockTermsReader.cs:line 986
at Lucene.Net.Codecs.BlockTerms.BlockTermsReader.FieldReader.SegmentTermsEnum.DocsAndPositions(IBits liveDocs, DocsAndPositionsEnum reuse, DocsAndPositionsFlags flags) in F:\Projects\lucenenet\src\Lucene.Net.Codecs\BlockTerms\BlockTermsReader.cs:line 798
at Lucene.Net.Index.TermsEnum.DocsAndPositions(IBits liveDocs, DocsAndPositionsEnum reuse) in F:\Projects\lucenenet\src\Lucene.Net\Index\TermsEnum.cs:line 242
at Lucene.Net.Index.CheckIndex.TestTermVectors(AtomicReader reader, TextWriter infoStream, Boolean verbose, Boolean crossCheckTermVectors) in F:\Projects\lucenenet\src\Lucene.Net\Index\CheckIndex.cs:line 2146
test: docvalues...........OK [1 docvalues fields; 0 BINARY; 0 NUMERIC; 1 SORTED; 0 SORTED_SET]
FAILED
WARNING: fixIndex() would remove reference to this segment; full exception:
Lucene.Net.Util.LuceneSystemException: Term Index test failed
at Lucene.Net.Index.CheckIndex.DoCheckIndex(IList`1 onlySegments) in F:\Projects\lucenenet\src\Lucene.Net\Index\CheckIndex.cs:line 848
WARNING: 1 broken segments (containing 43 documents) detected
Well, the good news is that this seems to be 100% stable on .NET Framework.
The bad news is that it is not performing as well on .NET Core. I attempted to swap the m_staleFiles set with new JCG.HashSet<T>().AsConcurrent() and it seemed to make Windows worse, but Linux and macOS better (NOTE: I rolled that change back). They still all randomly fail, but it is on the order of:
- Windows: Once every ~150 power failures
- Linux: Once every ~350 power failures
- macOS: Once every ~500 power failures
This is still a huge improvement over 4.8.0-beta00016, which would get a corrupt index about once every ~30 power failures.
Here are the results of the runs with the swapped hashset implementation with [Repeat(25)] (which runs the test 250 times, since the original test runs it 10 times):
https://dev.azure.com/shad0962/Experiments/_build/results?buildId=2343&view=results https://dev.azure.com/LuceneNET-Temp/Lucene.NET/_build/results?buildId=2257&view=results https://dev.azure.com/lucene-net-temp2/Lucene.NET/_build/results?buildId=1014&view=results https://dev.azure.com/lucene-net-temp3/Lucene.NET/_build/results?buildId=850&view=results https://dev.azure.com/lucene-net-temp4/Main/_build/results?buildId=273&view=results
Note that I left the commit with the [Repeat(25)] attribute on this branch. It should be reverted before merging to master to restore the original behavior.
I also improved the TestIndexWriterOnJRECrash test by using a socket instead of a temp file to transfer the process id to kill from the forked test to back to the original process. The original Java test crashed the JRE from the inside using a low level function, but .NET doesn't allow killing the current process from the inside so we needed 2-way communication between the 2 test processes.
Very nice! Do you think that maybe the thread-unsafety of FSDirectory (see my comment above about the race conditions) could be causing that? And maybe if .NET 5+ performs better than .NET FX, that is exacerbating it? I'll look at improving the thread safety here next and see if that can have better results. Because one very real possibility in this code (and AFAICT Java Lucene 4.8 too) under heavy load is that an fsync/FlushFileBuffers gets missed because of thread races.
Very nice! Do you think that maybe the thread-unsafety of FSDirectory (see my comment above about the race conditions) could be causing that? And maybe if .NET 5+ performs better than .NET FX, that is exacerbating it? I'll look at improving the thread safety here next and see if that can have better results. Because one very real possibility in this code (and AFAICT Java Lucene 4.8 too) under heavy load is that an fsync/FlushFileBuffers gets missed because of thread races.
Now that you mention it, it is possible that is the cause, (or a contributor) since we could be Fsyncing a partially written buffer to disk. While building a production quality implementation may take some time, if we locked all access to write, flush, and sync on all files, it would make testing the scenario pretty simple. I will set that up and run the check.
@NightOwl888 I just pushed up a change that synchronizes access to the HashSet, let me know what you think.
The performance is even better with this new approach. Of course some heavily multi-threaded writes/syncs might see a hit, but at least it's more thread-safe now.
| Method | Job | Runtime | IterationCount | LaunchCount | WarmupCount | N | Mean | Error | StdDev | Ratio | RatioSD |
|------------------------ |--------- |--------- |--------------- |------------ |------------ |--- |---------:|----------:|---------:|------:|--------:|
| IndexDocumentsBenchmark | .NET 7.0 | .NET 7.0 | Default | Default | Default | 10 | 32.06 ms | 0.771 ms | 2.135 ms | 1.00 | 0.00 |
| IndexDocumentsBenchmark | .NET 8.0 | .NET 8.0 | Default | Default | Default | 10 | 31.57 ms | 0.781 ms | 2.189 ms | 0.99 | 0.09 |
@paulirwin @NightOwl888 Is this moving forward? Do we need to wait for this to get merged in order to use .NET 8 as per https://github.com/apache/lucenenet/issues/933?