seqexp
seqexp copied to clipboard
Greedy and Reluctant Quantifiers should not produce same match
This is behaving to the defined spec ("Greediness affects only submatches and never changes the whole match..."), but it doesn't match the "standard" behavior:
(= (re-find #".+" "AAA") "AAA") ;; same in seqexp
(= (re-find #".+?" "AAA") "A") ;; seqexp _+? would say "AAA"
Reluctant and non-reluctant return the same thing.
(se/exec (se/cat (se/+ 3)) [3 3 3 3]) ;; -> {:match (3 3 3 3), :rest nil}
(se/exec (se/cat (se/+? 3)) [3 3 3 3]) ;; -> {:match (3 3 3 3), :rest nil}
;; expected/wanted -> {:match (3), :rest (3 3 3)}
Maybe there is a small modification to support "traditional" reluctant quantifiers?
Hi Luke,
You're right on this behavior. It left me unsatisfied but it was the most expeditive way to have longest match and still be as lazy as possible. Maybe I should rename the current exec to lazy-eager-exec and have a non-lazy exec that behaves in a least surprising manner.
(defn strict-exec [re coll & opts]
(let [k (gensym :match)]
(when-some [m (apply exec (cat (as k re) (* _)) coll opts)]
(-> m (dissoc m k) (assoc :match (get m k))))))
I think that doing that in a lazy fashion is a more involved change
That would be great if you added strict-exec
(maybe eager-exec
?) with the expected behavior to a released version of this package. I find the name lazy-eager-exec
confusing because 'lazy' and 'eager' are opposites (though I understand they refer to 2 different things in this context, the way the rule treats its input and the nature of the match it produces).
I didn't realize you'd just given me a wrapper function around your library :) (new to Clojure) Upon doing that though, I find that it seems to break longest-match semantics that it had; this test now fails:
(deftest simple
(are [se s m]
(= (:match (strict-exec se s)) m)
(se/| (se/cat 1 se/_) (se/cat 1 3 3)) [1 3 3 7 7 9] [1 3 3] ; longest match
))
Easy fix for that?
It's normal: (re-find #"1.|133" "133337") yields "13" too.
Le vendredi 29 août 2014, Luke Nezda [email protected] a écrit :
I didn't realize you'd just given me a wrapper function around your library :) (new to Clojure) Upon doing that though, I find that it seems to break longest-match semantics that it had; this test now fails:
(deftest simple (are [se s m](= %28:match %28strict-exec se s%29%29 m) (se/| (se/cat 1 se/_) (se/cat 1 3 3)) [1 3 3 7 7 9] [1 3 3] ; longest match ))
Easy fix for that?
— Reply to this email directly or view it on GitHub https://github.com/cgrand/seqexp/issues/1#issuecomment-53822547.
On Clojure http://clj-me.cgrand.net/ Clojure Programming http://clojurebook.com Training, Consulting & Contracting http://lambdanext.eu/
I had immediately done a similar test leading to the same conclusion, and even found http://stackoverflow.com/a/4515435/689119 explaining this is how NFA engines work, and already thought this was an NFA engine based on the references you provided, but hoped since the following passed, I'd misunderstood :smile: -- bummer, I guess I need a DFA engine since I need longest match semantics
(deftest simple
(are [se s m]
(= (:match (exec se s)) m)
(se/| (se/cat 1 se/_) (se/cat 1 3 3)) [1 3 3 7 7 9] [1 3 3] ; longest match
))
The DFA/NFA discussion is slightly wrong. Most NFA engines are backtracking engines. Mine is not. In some way, the algorithm I use builds the DFA on demand. The problem is that you want longest-match except when the regex is reluctant (and the fact that a regex is reluctant is ill defined: what if one branch is reluctant and the other eager – plus this notion is lost when the regex is transformed into a [ND]FA). Could we discuss why you need this behavior (longest or shortest match depending on the regex itself) and see how to deal with that?
I'd like a generic "rule engine" supporting regular expression syntax over sequences with generic function token predicates where operators behave in as familiar/standard a way as possible. More importantly, I'd like to be able to define a collection of rules and have the engine match the leftmost longest among them because otherwise maintenance of a rule set is complicated by a dependence on the order the rules are defined. I thought the collection could be encoded simply as an or ('|') of "top-level" rules whose matched content could be identified by named groups ('as'). Based on this description, clearly longest-match is more important (to support a maintainable rule collection) then shortest-match (so reluctant operator behavior is familiar/standard).
About familiarity, to recap: strict-exec behaves like re-find (except the match is anchored at the start) which implies that it doesn't always yield the longest match.
=> (strict-exec (* 1) [1 1 1 1 2])
{:match (1 1 1 1), :rest nil}
=> (re-find #"1*" "11112")
"1111"
=> (strict-exec (*? 1) [1 1 1 1 2])
{:match (), :rest nil}
=> (re-find #"1*?" "11112")
""
However if you go for the standard exec
, inner reluctant quantifiers (that is, those battling with greedy ones) still behaves as expected. Only unrestricted reluctant quantifiers go wild!
=> (exec (cat (as :1+? (+? 1)) (as :1+ (+ 1))) [1 1 1 1 2])
{:1+ (1 1 1), :1+? (1), :match (1 1 1 1), :rest (2)}
=> (strict-exec (cat (as :1+? (+? 1)) (as :1+ (+ 1))) [1 1 1 1 2])
{:1+ (1 1 1), :1+? (1), :match (1 1 1 1), :rest (2)}
=> (re-find #"(1+?)(1+)" "11112")
["1111" "1" "111"]
; it does not depend on their relative ordering: they just have to compete
=> (exec (cat (as :1+ (+ 1)) (as :1+? (+? 1))) [1 1 1 1 2])
{:1+? (1), :1+ (1 1 1), :match (1 1 1 1), :rest (2)}
=> (strict-exec (cat (as :1+ (+ 1)) (as :1+? (+? 1))) [1 1 1 1 2])
{:1+? (1), :1+ (1 1 1), :match (1 1 1 1), :rest (2)}
=> (re-find #"(1+)(1+?)" "11112")
["1111" "111" "1"]
By the way, strict-exec
I gave you earlier always reported nil
for :rest
, here is a more correct version:
(defn strict-exec [re coll & opts]
(let [k (gensym :match)
r (gensym :rest)]
(when-some [m (apply exec (cat (as k re) (as r (* _))) coll opts)]
(-> m (dissoc m k r)
(assoc :match (get m k) :rest (get m r))))))
I didn't mean relative rule ordering would cause problems with the reluctant operators, I meant I think encoding an evolving collection of "top-level" rules with a root alternation without leftmost longest behavior would be problematic to maintain (e.g., reordering rules could easily break things).
After more thought, I think for my application, the reluctant operators within non-top-level rules are important and the leftmost longest match among top-level rules is important. Could this combination be implemented efficiently with an additional "leftmost longest match alternation operator"?
(disclaimer: caffeine hasn't kicked in yet ;-)) I think I need examples (inputs & expected output) to really get the behavior you're after.
For example "leftmost longest" usually means "the longest match amongst the leftmost matches" (and it implies that the leftmost matches all start at the same position, otherwise they wouldn't be all leftmost). However reading you I'm starting to wonder whether you mean "leftmost" as in in "leftmost alternation in the pattern".
exec
yields longest prefix match (if any) — so does adding a find
function yielding the leftmost (instead of the prefix) longest match be ok? Or am I missing a case?
I am only meaning leftmost longest in the standard among-the-matches sense
Here's an example of what I'm thinking of with a fictitious operator longest
:
(longest
(as :shorter (cat (+? 1)))
(as :longer (cat (as :chunk1 (+? 1)) (as :skipped (_ *?)) (as :chunk2 (+ 2)))))
that given [1 1 1 3 3 2 2]
would return
{:longer [1 1 1 3 3 2 2] :chunk1 [1] :skipped [1 1 3 3] chunk2: [2 2]}
I realize I could just run each argument "top level rule" of longest
in turn with strict-exec
and retain the one with the longest :match
but I assume running them together as a single FSA will be more efficient.