lucene-solr icon indicating copy to clipboard operation
lucene-solr copied to clipboard

LUCENE-8985: SynonymGraphFilter cannot handle input stream with tokens filtered.

Open chenkovsky opened this issue 6 years ago • 15 comments

Description

SynonymGraphFilter cannot handle input stream with tokens filtered.

Solution

Original implementation BufferedInputToken doesn't track startNode, endNode. I replace BufferedInputToken & BufferedOutputToken with BufferedToken, and track startNode & endNode in every Buffered Token. use new variables to track previous input startNode, previous output StartNode, and delta of StartNode (If we add synonyms, the next token's startNode will change.).

Tests

TestSynonymGraphFilter.testWithStopwordGaps

Checklist

Please review the following and check all that apply:

  • [x] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • [x] I have created a Jira issue and added the issue ID to my pull request title.
  • [x] I am authorized to contribute this code to the ASF and have removed any code I do not have a license to distribute.
  • [x] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • [x] I have developed this patch against the master branch.
  • [x] I have run ant precommit and the appropriate test suite.
  • [x] I have added tests for my changes.
  • [x] I have added documentation for the Ref Guide (for Solr changes only).

chenkovsky avatar Sep 21 '19 15:09 chenkovsky

wait a minute, I add more testcases. It seems my patch is not correct now. I will spend more time to solve it.

chenkovsky avatar Sep 22 '19 07:09 chenkovsky

Waiting for you to update the PR with more test cases, is that correct @chenkovsky ?

janhoy avatar Sep 23 '19 16:09 janhoy

@janhoy yes.

chenkovsky avatar Sep 24 '19 02:09 chenkovsky

Would like to get this into 8.3, and release process starts in one week. @chenkovsky, do you have time to wrap it up?

janhoy avatar Oct 07 '19 18:10 janhoy

Also, please add an entry in lucene/CHANGES.txt to this PR

janhoy avatar Oct 07 '19 18:10 janhoy

@mikemccand or @msokolov are you available for quick review?

janhoy avatar Oct 07 '19 18:10 janhoy

@janhoy OK. I know where's my mistake now. I will submmit new patch today.

chenkovsky avatar Oct 08 '19 06:10 chenkovsky

@janhoy I have submitted the code, and passed all testcases. A small difference compared to original implementation is that I will capture all tokens(original implementation won't capture token if the first token is not matched. I think it will make code lengthy and more fragile, but get little performance improvement. so I haven't implemented it. If you think it's important, I can change my code to implement it).

chenkovsky avatar Oct 09 '19 09:10 chenkovsky

Hmm, your commit 867cbeff010ff0febbbd258368a58ddecc770b93 labeled "bug fix" is changing a ton of code and changing this PR from a quite simple low-risk patch into a more complex one.

And you disabled a test, I don't know the consequences of that, are you saying the new patch has a performance hit?

I have not attempted to review the patch (I don't even know the code) but invited two committers to review, so far no response. Until then it would be helpful if you could clean up any whitespace-only edits, and other things that would make it easier to understand your changes.

I'll leave further review to the experts :)

janhoy avatar Oct 10 '19 21:10 janhoy

I have changed the code, and enabled the capture num test. now the capture behavior is same as the original implementation. actually no performance impact.

chenkovsky avatar Oct 11 '19 02:10 chenkovsky

@mikemccand I agree that separating refactoring code into another pull request is better. actually I tried it at the beginning. But I find that I have to pay much more effort to reach this aim. I find that the old code is trying to maintain the state of looking ahead tokens in a complex way. the bug hides in this complex way(a lot of liveToken if else statement), I cannot tell if I fixed the bug, are there any more bugs. So I simplify the implementation.

chenkovsky avatar Dec 12 '19 08:12 chenkovsky

is SynonymGraphFilter also broken if a synonym contains a stopword? I think (not sure) the old SynonymFilter could handle this case?

@mikemccand I tested some cases, the new code behaves same to old code. could you please give me an example, I can add it to testcase.

chenkovsky avatar Dec 12 '19 08:12 chenkovsky

To move this forward, perhaps adding more unit test examples to the PR is a good way. I think what Mike meant was a test that have a synonym with a stopword, e.g. RTFM => read the fine manual where "the" is a stopword and should leave a hole.

janhoy avatar Jan 22 '20 09:01 janhoy

@chenkovsky I'd like to help move this forward, and I think the next step is to add 5-10 more unit tests that prove that behavior is not broken in unpredictable ways.

janhoy avatar Apr 01 '20 12:04 janhoy

@janhoy I added more tests. I can add more, if you think it's necessary. I find that FlattenGraphFilter also removes holes. for example, the sentence 'the the usa', we filter 'the'. for word 'usa', the posIncr of SynonymGraphFilter is 3. but posIncr of FlattenGraphFilter is 2. does it match expectation? if not, I think we should also fix FlattenGraphFilter.

chenkovsky avatar Apr 04 '20 03:04 chenkovsky