multiregexp
multiregexp copied to clipboard
Searcher treats `^` as literal
Because a searcher is constructed by prefixing with .*, any patterns starting with ^ have that treated as a literal instead of a start of line anchor.
I've added a fix to my fork (https://github.com/neilireson/multiregexp), where if the pattern startsWith the specified exceptions (i.e. ".*", "^") then the prefix is not added.
However the fork also contains a raft of other changes. Mainly these are optimisations, as I'm trying to get multiregexp to work with 20,000+ patterns, the base functionality is (or should be) the same, as all the previous methods should default to previous behaviour. The only exception being that I'm using a multithreaded make to build the MultiPatternAutomaton.
@neilireson Interesting.
When I got to deal with many pattern, I just grouped them in pack of 50 or so patterns. 20000+ sounds like a gigantic DFA after the powerset operation! Is it working alright? Also if you have pattern, that are really just strings and not pattern, it might be interesting to treat them separately with an implementation of ahocorasick.
@OrangeDog Oh yes this is a valid point. If you guys have working code for this, I welcome pull request.
Firstly, thanks very much for providing this code, it's very cool.
OK I've added all my current commits to the Pull Request. To be honest I've been using SVN for years so I'm new to the GIT world. Let me know if I need to do anything else.
I could use Aho-Corasick, do you think it would be faster?
Multiregexp offers some advantages. One use case is person names where I use the patterns " Smith " John Smith ", " David Smith ", ... I then have a disambiguation process where everyone would match "Smith", but "David Smith" only matches the David's. I also have generic patterns " .* Smith", which enables me to check for names outside my dictionary (e.g. "Fred Smith").
If Aho-Corasick would be faster I could use a combination of the two approaches
I also use it for words with multiple suffixes, e.g. "word[a-z]* ", but I could probably just enumerate all the possibilities.
TBH I'm just using java.util.regex.Pattern now, as there are fewer surprises.
Pattern.compile(patterns.stream()
.map(s -> "(?:" + s + ")")
.collect(Collectors.joining("|")))
@OrangeDog thanks for reporting the issue anyway :) This is very helpful.