lucene-solr
lucene-solr copied to clipboard
LUCENE-8985: SynonymGraphFilter cannot handle input stream with tokens filtered.
Description
SynonymGraphFilter cannot handle input stream with tokens filtered.
Solution
Original implementation BufferedInputToken doesn't track startNode, endNode. I replace BufferedInputToken & BufferedOutputToken with BufferedToken, and track startNode & endNode in every Buffered Token. use new variables to track previous input startNode, previous output StartNode, and delta of StartNode (If we add synonyms, the next token's startNode will change.).
Tests
TestSynonymGraphFilter.testWithStopwordGaps
Checklist
Please review the following and check all that apply:
- [x] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
- [x] I have created a Jira issue and added the issue ID to my pull request title.
- [x] I am authorized to contribute this code to the ASF and have removed any code I do not have a license to distribute.
- [x] I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
- [x] I have developed this patch against the
masterbranch. - [x] I have run
ant precommitand the appropriate test suite. - [x] I have added tests for my changes.
- [x] I have added documentation for the Ref Guide (for Solr changes only).
wait a minute, I add more testcases. It seems my patch is not correct now. I will spend more time to solve it.
Waiting for you to update the PR with more test cases, is that correct @chenkovsky ?
@janhoy yes.
Would like to get this into 8.3, and release process starts in one week. @chenkovsky, do you have time to wrap it up?
Also, please add an entry in lucene/CHANGES.txt to this PR
@mikemccand or @msokolov are you available for quick review?
@janhoy OK. I know where's my mistake now. I will submmit new patch today.
@janhoy I have submitted the code, and passed all testcases. A small difference compared to original implementation is that I will capture all tokens(original implementation won't capture token if the first token is not matched. I think it will make code lengthy and more fragile, but get little performance improvement. so I haven't implemented it. If you think it's important, I can change my code to implement it).
Hmm, your commit 867cbeff010ff0febbbd258368a58ddecc770b93 labeled "bug fix" is changing a ton of code and changing this PR from a quite simple low-risk patch into a more complex one.
And you disabled a test, I don't know the consequences of that, are you saying the new patch has a performance hit?
I have not attempted to review the patch (I don't even know the code) but invited two committers to review, so far no response. Until then it would be helpful if you could clean up any whitespace-only edits, and other things that would make it easier to understand your changes.
I'll leave further review to the experts :)
I have changed the code, and enabled the capture num test. now the capture behavior is same as the original implementation. actually no performance impact.
@mikemccand I agree that separating refactoring code into another pull request is better. actually I tried it at the beginning. But I find that I have to pay much more effort to reach this aim. I find that the old code is trying to maintain the state of looking ahead tokens in a complex way. the bug hides in this complex way(a lot of liveToken if else statement), I cannot tell if I fixed the bug, are there any more bugs. So I simplify the implementation.
is SynonymGraphFilter also broken if a synonym contains a stopword? I think (not sure) the old SynonymFilter could handle this case?
@mikemccand I tested some cases, the new code behaves same to old code. could you please give me an example, I can add it to testcase.
To move this forward, perhaps adding more unit test examples to the PR is a good way. I think what Mike meant was a test that have a synonym with a stopword, e.g. RTFM => read the fine manual where "the" is a stopword and should leave a hole.
@chenkovsky I'd like to help move this forward, and I think the next step is to add 5-10 more unit tests that prove that behavior is not broken in unpredictable ways.
@janhoy I added more tests. I can add more, if you think it's necessary. I find that FlattenGraphFilter also removes holes. for example, the sentence 'the the usa', we filter 'the'. for word 'usa', the posIncr of SynonymGraphFilter is 3. but posIncr of FlattenGraphFilter is 2. does it match expectation? if not, I think we should also fix FlattenGraphFilter.