aho-corasick icon indicating copy to clipboard operation
aho-corasick copied to clipboard

Fix leftMostLongestMatch returning all overlapping matches

Open semvis123 opened this issue 1 year ago • 2 comments

https://github.com/petar-dambovaliev/aho-corasick/pull/13 Caused overlapping patterns to return multiple matches. This is fixed by only using the position new calculation for matches that are being cancelled due to MatchOnlyWholeWords.

Edit: Just realized that this only accounts for the cases where the begin position is different. (so, this is not a complete fix) Example case where it will still fail:

func TestOverlappingPatterns4(t *testing.T) {
	trieBuilder := NewAhoCorasickBuilder(Opts{
		MatchOnlyWholeWords: true,
		MatchKind:           LeftMostLongestMatch,
		DFA:                 false,
	})

	patterns := []string{"testing", "testing 123"}

	trie := trieBuilder.Build(patterns)
	result := trie.FindAll("testing 12345")
	if len(result) != 1 {
		t.Logf("%v", result)
		t.Error("Did not find match in string")
		t.FailNow()
	}
}

semvis123 avatar Jul 27 '23 20:07 semvis123

After some thoughts I think that this isn't the way to fix the issue. I suspect that the issue is that the leftmostFindAtNoStateImp overrides the lastmatch with an invalid match (one that is not a whole word for example). This makes it impossible for the iterator to get the match before the invalid match.

Any ideas on how to fix this properly @petar-dambovaliev ?

semvis123 avatar Jul 31 '23 13:07 semvis123

After some thoughts I think that this isn't the way to fix the issue. I suspect that the issue is that the leftmostFindAtNoStateImp overrides the lastmatch with an invalid match (one that is not a whole word for example). This makes it impossible for the iterator to get the match before the invalid match.

Any ideas on how to fix this properly @petar-dambovaliev ?

I will take a look one of these weekends. I need to remember what i did there because i haven't touched this codebase since i wrote it 2 years ago.

peter7891 avatar Aug 01 '23 17:08 peter7891

@semvis123 it seems, the MatchOnlyWholeWords option is incompatible with the way LeftMostLongestMatch works. I'll probably need to disable them being used together or make it clear how it works. MatchOnlyWholeWords is not part of the matching algorithm, it is built on top and things are filtered after. I am tempted to remove MatchOnlyWholeWords

petar-dambovaliev avatar Apr 11 '24 09:04 petar-dambovaliev

I've added some docs explaining this. I'll need to think about a way forward. Thanks for bringing this up.

petar-dambovaliev avatar Apr 11 '24 10:04 petar-dambovaliev