multiregexp icon indicating copy to clipboard operation
multiregexp copied to clipboard

Searcher treats `^` as literal

Open OrangeDog opened this issue 9 years ago • 7 comments

Because a searcher is constructed by prefixing with .*, any patterns starting with ^ have that treated as a literal instead of a start of line anchor.

OrangeDog avatar Apr 18 '16 11:04 OrangeDog

I've added a fix to my fork (https://github.com/neilireson/multiregexp), where if the pattern startsWith the specified exceptions (i.e. ".*", "^") then the prefix is not added.

However the fork also contains a raft of other changes. Mainly these are optimisations, as I'm trying to get multiregexp to work with 20,000+ patterns, the base functionality is (or should be) the same, as all the previous methods should default to previous behaviour. The only exception being that I'm using a multithreaded make to build the MultiPatternAutomaton.

neilireson avatar Apr 19 '16 11:04 neilireson

@neilireson Interesting.

When I got to deal with many pattern, I just grouped them in pack of 50 or so patterns. 20000+ sounds like a gigantic DFA after the powerset operation! Is it working alright? Also if you have pattern, that are really just strings and not pattern, it might be interesting to treat them separately with an implementation of ahocorasick.

fulmicoton avatar Apr 19 '16 13:04 fulmicoton

@OrangeDog Oh yes this is a valid point. If you guys have working code for this, I welcome pull request.

fulmicoton avatar Apr 19 '16 13:04 fulmicoton

Firstly, thanks very much for providing this code, it's very cool.

OK I've added all my current commits to the Pull Request. To be honest I've been using SVN for years so I'm new to the GIT world. Let me know if I need to do anything else.

neilireson avatar Apr 19 '16 14:04 neilireson

I could use Aho-Corasick, do you think it would be faster?

Multiregexp offers some advantages. One use case is person names where I use the patterns " Smith " John Smith ", " David Smith ", ... I then have a disambiguation process where everyone would match "Smith", but "David Smith" only matches the David's. I also have generic patterns " .* Smith", which enables me to check for names outside my dictionary (e.g. "Fred Smith").

If Aho-Corasick would be faster I could use a combination of the two approaches

I also use it for words with multiple suffixes, e.g. "word[a-z]* ", but I could probably just enumerate all the possibilities.

neilireson avatar Apr 19 '16 14:04 neilireson

TBH I'm just using java.util.regex.Pattern now, as there are fewer surprises.

Pattern.compile(patterns.stream()
    .map(s -> "(?:" + s + ")")
    .collect(Collectors.joining("|")))

OrangeDog avatar Apr 19 '16 15:04 OrangeDog

@OrangeDog thanks for reporting the issue anyway :) This is very helpful.

fulmicoton avatar Apr 19 '16 23:04 fulmicoton