lucenenet icon indicating copy to clipboard operation
lucenenet copied to clipboard

TokenStream.IncrementToken() is called after Dispose() is called

Open clambertus opened this issue 4 years ago • 6 comments

When overriding Dispose(bool) in either TokenStream subclasses, Lucene.Net will call Dispose() before it is done using the TokenStream, and call IncrementToken() again.

 

The behavior can be observed in the Lucene.Net.Collation.TestICUCollationKeyFilter.TestCollationKeySort() when the ICUCollationKeyFilter implements Dispose().

JIRA link - [LUCENENET-611] created by nightowl888

clambertus avatar Jul 13 '19 02:07 clambertus

It is possible that this issue was due to a bug in Lucene.Net.TestFramework that has been addressed already. We need to confirm that it was the test framework that was errantly causing the double-call, not a feature of Lucene.NET.

Also, the ICUCollationKeyFilter no longer implements Dispose() - that was when we were referencing icu-dotnet (which had unmanaged resources) instead of ICU4N. However, the behavior can be observed in any existing disposable TokenStream or by overriding Dispose(bool) in one that isn't disposable.

by nightowl888

clambertus avatar Apr 09 '20 21:04 clambertus

I looked into this a little, and it appears that the bug that was fixed in the test framework didn't solve this. When adding a Dispose(bool) override to the ICUCollationKeyFilter, Dispose(true) is still being called followed by additional calls to IncrementToken().

This is occuring in DocFieldProcessor which does a tight loop around some fields. It is possible that the consumer will dispose the TokenFilter in one loop and then another loop will retrieve the same instance of that TokenFilter by calling Field.GetTokenStream(Analyzer) and call IncrementToken() on it after it is disposed.

At this point I haven't confirmed this is how it works in Java so it is still unclear if this is expected behavior or a bug. If the former, we should consider the possibility of making TokenStream "close" and do that separately than "dispose" (see #265).

NightOwl888 avatar Jul 11 '20 21:07 NightOwl888

I started by trying to solve the two TestRandomChains failures (see #269), but it ended up unraveling back to this issue again.

I isolated a specific case that fails in .NET:

[Test]
public void TestWordDelimiterFilterChain()
{
    Analyzer a = Analyzer.NewAnonymous(createComponents: (fieldName, reader) =>
    {
        Tokenizer tokenizer = new NGramTokenizer(LuceneVersion.LUCENE_48, reader, 70, 79);
        TokenStream filter = tokenizer;
        filter = new WordDelimiterFilter(LuceneVersion.LUCENE_48, filter, WordDelimiterFlags.CATENATE_NUMBERS | WordDelimiterFlags.SPLIT_ON_CASE_CHANGE, new CharArraySet(LuceneVersion.LUCENE_48, new HashSet<string> { "htxodcrfoi", "zywcmh", "pkw", "urdvcqr", "gbuskh" }, ignoreCase: true));
        filter = new ShingleFilter(filter, "EMAIL");
        return new TokenStreamComponents(tokenizer, filter);
    });

    string text = "ufsczdev -[({0, \u20d0\u20f4\u20f4\u20d4\u20e0\u20e4\u20d0\u20ef\u20e7\u20dd\u20f7\u20d2\u20e1 kdoxod \u175b\u1758\u174a pirsk --><scr \ufe5d\ufe51\ufe6c\ufe5d\ufe5e\ufe61 \ud802\udf57 \ua6fb\ua6cd \ufc78\uf998\u06e9\u1b3a \ud802\ude2c\ud802\ude59\ud802\ude22\ud802\ude3b lkpnukew rqniibv \ue990\ud8d0\udfe6\ue6a8[ lmev wzpts \u306a\u304a\u3093\u3080\u3092";
    bool useCharFilter = false;
    bool offsetsAreCorrect = true;

    long seed = 0x21773254b5ac8fa0L;
    Random random = new J2N.Randomizer(seed);


    try
    {
        CheckAnalysisConsistency(random, a, useCharFilter, text, offsetsAreCorrect);
    }
    catch (Exception e) when (e.IsThrowable())
    {
        Console.WriteLine("Exception from random analyzer: " + a);
        throw;
    }
}

And after translating it to Java, it still fails:

public void testWordDelimiterFilterChain() throws Throwable {
  Analyzer a = new Analyzer() {
    public TokenStreamComponents createComponents(String fieldName, Reader reader) {
      Tokenizer tokenizer  = new org.apache.lucene.analysis.ngram.NGramTokenizer(Version.LUCENE_48, reader, 70, 79);
      TokenStream filter = tokenizer;
      filter = new WordDelimiterFilter(Version.LUCENE_48, filter, WordDelimiterFilter.CATENATE_NUMBERS | WordDelimiterFilter.SPLIT_ON_CASE_CHANGE, new CharArraySet(Version.LUCENE_48, asSet("htxodcrfoi", "zywcmh", "pkw", "urdvcqr", "gbuskh"), /*ignoreCase:*/ true));
      filter = new org.apache.lucene.analysis.shingle.ShingleFilter(filter, "EMAIL");
      
      return new TokenStreamComponents(tokenizer, filter);
    }
  };
  
  String text = "ufsczdev -[({0, \u20d0\u20f4\u20f4\u20d4\u20e0\u20e4\u20d0\u20ef\u20e7\u20dd\u20f7\u20d2\u20e1 kdoxod \u175b\u1758\u174a pirsk --><scr \ufe5d\ufe51\ufe6c\ufe5d\ufe5e\ufe61 \ud802\udf57 \ua6fb\ua6cd \ufc78\uf998\u06e9\u1b3a \ud802\ude2c\ud802\ude59\ud802\ude22\ud802\ude3b lkpnukew rqniibv \ue990\ud8d0\udfe6\ue6a8[ lmev wzpts \u306a\u304a\u3093\u3080\u3092";
  boolean useCharFilter = false;
  boolean offsetsAreCorrect = false;
  
  long seed = 0x21773254b5ac8fa0L;
  java.util.Random random = new java.util.Random(seed);
  
  
  try
  {
      checkAnalysisConsistency(random, a, useCharFilter, text, offsetsAreCorrect);
  }
  catch (Throwable e)
  {
      System.out.println("Exception from random analyzer: " + a);
      throw e;
  } 
 }

I discovered this is primarily due to the fact that some flags are not supported during indexing: https://stackoverflow.com/a/66478863/, and confirmed that removing the WordDelimiterFlags.CATENATE_NUMBERS flag makes the test pass.

In TestRandomChains, WordDelimiterFilter is added to the brokenOffsetsConstructors dictionary. This causes offsetsAreCorrect to be set to true, which BaseTokenStreamTestCase.CheckAnalysisConsistency is supposed to handle. BaseTokenStreamTestCase has been changed from its original implementation to include try-finally and using blocks (both in CheckAnalysisConsistency() and in AssertTokenStreamContents() and that is where we are running into trouble.

Long story short, this circles back to the fact that we converted the Closable interface in Java to IDisposable in .NET on TokenStream. But TokenStream's close() method has preconditions that must be met or it will throw an exception. The exception doesn't play well when you are trying to close open TextReaders successfully. Before these finally blocks were put into place to call Dispose() one failing analysis test would cause nearly all of them to fail.

Since most of Lucene's tests do not call close() in a finally block, I have long suspected that there is something automatic happening to call these methods, and after a bit of research, I discovered that Closable inherits AutoClosable, which has configuration options that can be set to make it implicit. But it is clear this is going to take more research to work out how it works in Lucene and the best way to make it translate to .NET in a reasonable way.

Some Options to Consider

  1. Add a Close() method that has the precondition checks and use Dispose() to only do the cleanup.
  2. In Dispose(), only suppress the finalizer in the case where the preconditions pass, but rely on a finalizer in cases where it fails.
 private bool isValid;

~TokenStream() // Only called if we don't meet our preconditions
{
    Dispose(false);
}

public void Dispose()
{
   Dispose(true);
   GC.SuppressFinalize(this);
}

protected virtual void Dispose(bool disposing)
{
    if (disposing)
    {
        isValid = Validate(out string reason);
        if (!isValid)
            throw new InvalidOperationException(reason); // Exception short circuits the call to SuppressFinalize

        DoDispose();
    }

    if (!isValid)
        DoDispose();
}

internal void DoDispose()
{
  // Dispose file handles here...
}

Of course, that means we need to wait for the finalizer to execute before beginning the next test. From the finalizer docs:

The programmer has no control over when the finalizer is called; the garbage collector decides when to call it. The garbage collector checks for objects that are no longer being used by the application. If it considers an object eligible for finalization, it calls the finalizer (if any) and reclaims the memory used to store the object. It's possible to force garbage collection by calling Collect, but most of the time, this call should be avoided because it may create performance issues.

Options for fixing the tests:

  1. Call GC.Collect() in a catch block to ensure the finalizer gets called first.
  2. Call DoDispose() in a catch block to ensure we dispose everything right away, without waiting for the GC.

Since the preconditions are only meant to prevent programming bugs that aren't likely to make it into a production application, it seems reasonable to rely on a finalizer as a fallback for this edge case.

Further Consideration

Being that Lucene may call Close() on a TokenStream and then Reset() to start the process all over again, we should separate the concept of Close() from Dispose(). Whether we actually need a Dispose() method needs more analysis to figure out if TokenStream is actually responsible for cleaning up file handles, and if not, where that responsibility lies. Since TokenStream can effectively "Reopen" after it is called, it seems to make sense to at least have a Close() method to signal that we are not actually disposing anything and can still use the TokenStream instance.

But we definitely need to revert BaseTokenStreamTestCase to be closer to its original form and remove all of the finally blocks that "correctly consume" the TokenStream that are now causing other issues in the tests.

NightOwl888 avatar Nov 18 '21 16:11 NightOwl888

Throwing exceptions in Dispose() is bad form. See https://docs.microsoft.com/en-us/dotnet/fundamentals/code-analysis/quality-rules/ca1065 Or https://docs.microsoft.com/en-us/dotnet/api/system.idisposable.dispose?view=net-6.0:

If an object's Dispose method is called more than once, the object must ignore all calls after the first one. The object must not throw an exception if its Dispose method is called multiple times. Instance methods other than Dispose can throw an ObjectDisposedException when resources are already disposed.

vvdb-architecture avatar Aug 05 '22 06:08 vvdb-architecture

@vvdb-architecture

Thanks. So that leaves the question - how do we enforce the TokenStream workflow contract in .NET?

The exception is meant to be a guide to help at development time to ensure all of the requirements of the contract are followed. Once the contract is being followed, this exception will never occur at runtime. So, that means the part that they are worried about - adding an exception handler inside a finally block - is not a thing that will ever need to happen.

One option I have considered is to build a Roslyn code analyzer to enforce this contract at design time, but I am not sure how complex it would be to analyze APIs of the user code no matter how they have it configured and ensure that all of the methods are called in the right sequence and the right number of times in the code. They could be building one or more abstractions that do some of the TokenStream operations while leaving the others somewhere else, or they may be in conditional parts of the code, for example.

Some of the rules, such as enforcing the calls to base.Reset(), and base.End() when they are overridden would be easy to enforce, while others would take more work. The exception in the Dispose() method is definitely a simpler approach, but if we had an Roslyn code analyzer that worked reliably, it would help to speed up the development process by catching this sort of thing at design time instead of runtime.

However, this still doesn't get us out of the situation we are in where close() then a call to reset() to start the TokenStream over from the beginning is allowed in Lucene - Dispose() is meant to be the final call that doesn't allow anything else to happen to the object. This is still a violation of the Dispose() contract in .NET, so we need to dig into whether Close() makes sense here, just to inform consumers that it is not actually a Dispose() operation. Or maybe End() is supposed to signify when the end of the stream is reached (after which case Reset() is allowed) and Dispose() is supposed to signify the end of the operation. In that case the tight loop should be refactored never to call Dispose() inside of the loop and have a final loop at the end that calls Dispose() on all of the TokenStreams to ensure methods never are called after Dispose().

According to this error message, it is supposed to be invalid to call reset() multiple times. If that is really the case, then the tight loop can be considered a bug that needs to be fixed, and we don't need to change Dispose() back to Close(). Although, the bug may boil down to the fact we added an extra call to Dispose() in .NET and it shouldn't actually be there.

NightOwl888 avatar Nov 07 '22 05:11 NightOwl888

After some research, I have determined the following:

  1. Dispose() doesn't actually throw anything. The failure happens when attempting to set up the next field using the same TokenStream instance.
  2. The reason why we get cascade failures in the tests when enforcing the dispose call is that the current Dispose() implementation both calls TextReader.Dispose() (which is good) and sets the state to an ILLEGAL_STATE_READER (which is bad). Ideally, we only close the open file handles if there is an exception somewhere in the middle that is interfering with our calls and leave the other state alone so the next test can pass.
  3. The design of the close() method being called multiple times per TokenStream instance is intentional. They are designed to be reused according to the ReuseStrategy used by the Analyzer. There is some discussion of this in LUCENE-8651.
  4. The bit that loops through the fields and calls the TokenStream methods Reset(), IncrementToken(), End(), and Dispose() is all in DocInverterPerField.
  5. While Analyzer is IDisposable, it doesn't actually dispose any of its children (the state of the ReuseStrategy or any of its TokenStreamComponents), they are simply left to go out of scope.
  6. If we change Dispose() to Close() it only partially fixes the issue. The design doesn't allow for any long-lived or unmanaged state that is supposed to last the lifetime of the TokenStream instance. This makes it difficult to use anything that is expensive to set up in the implementation of a TokenStream. Something like that would need to be done in the Analyzer and passed into the TokenStream in the constructor in the call to CreateComponents(), which is not a clean separation of concerns.

The proposed solution to fix these issues is:

  • [ ] Add a new interface ICloseable with a single method Close(). We clearly need to separate the concept of Close() (that can be reused) from Dispose() (where we are getting rid of the instance for good).
  • [ ] Add back the IOUtils.Close... methods using the ICloseable interface. These are used in some places in a finally block to close a TokenStream and reset its state to ILLEGAL_STATE_READER.
  • [ ] Change all current Tokenizer, TokenStream, and TokenFilter implementations from using Dispose(bool) to Close().
  • [ ] Add a Dispose(bool) method to the base Tokenizer, TokenStream, and TokenFilter implementation and possibly anything that needs to override it. The base implementation of Dispose() will differ from Close() only in that it will not use ILLEGAL_STATE_READER.
  • [ ] Track the disposed state of Tokenizer, TokenStream, and TokenFilter and throw ObjectDisposedException in all methods after Dispose() is called. This functionality will need to be disabled for the token stream tests, but we should have tests to verify it works in other cases. (needs more consideration)
  • [ ] Make ReuseStrategy implement IDisposable and have Analyzer dispose it in its Dispose(bool) method.
  • [ ] Make TokenStreamComponents implement IDisposable and have each ReuseStrategy implementation call each instance of it in its Dispose(bool) method. It can do this by calling GetStoredValue(Analyzer), casting to the appropriate type, and calling Dispose() on all instances of TokenStreamComponents in the returned value.
  • [ ] Make TokenStreamComponents.Dispose(bool) cascade the call to its Tokenizer and TokenStream.
  • [ ] Change BaseTokenStreamTestCase to only call TokenStream.Dispose() in its finally block (a block that was added for .NET) rather than Reset(), IncrementToken(), End(), and Close(). This allows the test to dispose the open file handles without mutating the other state of the TokenStream that causes test failures.
  • Leave the existing runtime exceptions for misusing the TokenStream workflow contract or other rules in place. This ensures we don't need to make any changes to the tests.
  • [ ] Create Roslyn code analyzers to ensure that the basics, such as cascading calls to base.Reset(), base.End(), base.Close() and base.Dispose(bool), are enforced in subclasses. These should be set up to produce compile errors, not warnings. Ideally, they would be smart enough to identify when the user is delegating this responsibility to another class or method. Any gaps that we leave here will be caught by the existing Lucene runtime exceptions.

This neatly puts the lifetime of Tokenizer, TokenStream, and TokenFilter instances in sync with the Analyzer that creates them. So, if the user decides to create an Analyzer that has unmanaged resources to clean up, they simply need to change:

Analyzer analyzer = new MyAnalyzer();

to

using Analyzer analyzer = new MyAnalyzer();

Analyzer won't need to be disposed during normal usage, only for cases where there are expensive long-lived components to dispose.

I am toying with the idea of creating a DoClose() method that is shared between Close() and Dispose(), simply because cascading the call from Close() to Dispose() or from Dispose() to Close() violates the rules in both directions but we generally have common components to clean up in both methods. However, only those who need to implement their own Dispose() for long-lived components will need to deal with this duplication. In all other cases, the code will look almost exactly like it does in Lucene.

NightOwl888 avatar Nov 08 '22 09:11 NightOwl888