regex-tdfa
regex-tdfa copied to clipboard
Fix the front anchor optimization in multiline mode
This pull request fixes issue #12.
replicate 1000000 'f' =~ "\\`g" :: Bool -- returns quickly
replicate 1000000 'f' =~ "^g" :: Bool -- returns slowly as it needs to scan for newlines
('g':replicate 1000000 'f') =~ "^g" :: Bool -- returns quickly, even though it could match multiple things, we only ask for a bool
('g':replicate 1000000 'f') =~ "^g" :: [[String]] -- returns slowly, as it needs to gather all possible matches
All of the standard test cases pass, and I did some ad-hoc testing at the repl. multiline/singleline modes, new/old syntax, and `/^ all seem to be working correctly together. Let me know if you want me to add them to the test cases formally before you accept the pull request.
It seemed like what happened was that originally tdfa had only both single-line and multi-line modes, with only the "^" anchor. Since in multi-line mode the entire string needed to be scanned for all newlines, the front anchored optimization was disabled in multi-line mode. But when the "`" anchor was added, the logic to set the front-anchor optimization wasn't updated.
Hopefully the code change itself is self-explanatory. Instead of "&&", I pass (multiline co) into isDFAFrontAnchored, and in there if it is in multi-line mode it only treats "`" as an anchor, otherwise, it treats both "`" and "^" as anchors.
Thanks for the PR!
Let me know if you want me to add them to the test cases formally
Yes, please add test cases that are not already subsumed by the existing test cases. Can you also add a test that demonstrates the performance improvement?
It seemed like what happened was that originally tdfa had only both single-line and multi-line modes, with only the "^" anchor. Since in multi-line mode the entire string needed to be scanned for all newlines, the front anchored optimization was disabled in multi-line mode. But when the "`" anchor was added, the logic to set the front-anchor optimization wasn't updated.
Hopefully the code change itself is self-explanatory. Instead of "&&", I pass (multiline co) into isDFAFrontAnchored, and in there if it is in multi-line mode it only treats "
" as an anchor, otherwise, it treats both "" and "^" as anchors.
Thanks for the explanation! My take is to add such explanations also as comments to the source code, so that intentions are documented and can be matched against the actual implementation.
I made a commit to the branch to show I am still working on this, but it is very much a work in progress. It seems that a lot of the newer features don't have corresponding tests. (Testing multi-line versus single line mode, new syntax versus old syntax, as well as all of the anchor and boundary characters. ) So I am also adding all of those tests, as well as the ones that strictly cover my change. (Right now the tests I added just sit there and use up memory.)
The reason I added this comment is because I am about to go on vacation for a week with limited internet, and wanted to give a quick update before I left. Hopefully I'll have a final version of the tests ready in two weeks.