z3 icon indicating copy to clipboard operation
z3 copied to clipboard

Reasoning with Regex

Open pjljvandelaar opened this issue 5 years ago • 18 comments

The following smt input

(set-info :smt-lib-version 2.5)
(assert (forall ((x String)) 
                (=  (str.in.re x (re.++ (str.to.re "a") (re.* (str.to.re "a"))))
                    (str.in.re x (re.+ (str.to.re "a")))
                )
        )
)
(check-sat)

produces sat. Similar

(assert (forall ((x String))
                (=  (str.in.re x (re.union (re.++ (str.to.re "a") (str.to.re "b")) 
                                           (re.++ (str.to.re "a") (str.to.re "c"))
                                 )
                    )                                 
                    (str.in.re x (re.++ (str.to.re "a") (re.union (str.to.re "b")
                                                                  (str.to.re "c")
                                                        )
                                 )
                    )
                )
        )
)
(check-sat)

and

(assert (forall ((x String)) 
                (=  (str.in.re x (re.loop (str.to.re "a") 5 5))
                    (str.in.re x (str.to.re "aaaaa"))
                )
        )
)
(check-sat)

both produce sat. So there seems to be some sort of symbolic rewriting of Regexes to canical form, such that these cases are solvable. However some more experimenting with

(assert (forall ((x String)(l1 Int)(l2 Int))
                (=  (str.in.re x (re.++ (re.loop (str.to.re "a") l1) (re.loop (str.to.re "a") l2)))
                    (str.in.re x (re.loop (str.to.re "a") (+ l1 l2)))
                )
        )
)
(check-sat)
(assert (forall ((x String)(l1 Int)(l2 Int)(u1 Int)(u2 Int))
                (=  (str.in.re x (re.++ (re.loop (str.to.re "a") l1 u1) (re.loop (str.to.re "a") l2 u2)))
                    (str.in.re x (re.loop (str.to.re "a") (+ l1 l2) (+ u1 u2)))
                )
        )
)
(check-sat)

and

(assert (forall ((x String)(l1 Int)(l2 Int)(u1 Int)(u2 Int))
                (=  (str.in.re x (re.loop (re.loop (str.to.re "a") l1 u1) l2 u2))
                    (str.in.re x (re.loop (str.to.re "a") (* l1 l2) (* u1 u2)))
                )
        )
)
(check-sat)

all produce unknown. Are these symbolic rewrite rules missing? Are these regular expressions not simplified before solving?

Can this functionality be added since it helps in checking correctness of my programs?

pjljvandelaar avatar Sep 20 '19 08:09 pjljvandelaar

Note that with Z3str3 (i.e. z3 smt.string_solver=z3str3) I get two crashes instead of unknown in the last two smt examples!

pjljvandelaar avatar Sep 20 '19 08:09 pjljvandelaar

Not sure these rules should be valid. what is (loop a -1 -1) ++ (loop a 1 1) ? is it the empty RE or the singleton RE comprising of a.

NikolajBjorner avatar Sep 20 '19 09:09 NikolajBjorner

According to http://cvc4.cs.stanford.edu/wiki/Strings

Regular Expression Loop: 
 (re.loop r l u)
where r is a regular expression, l is a non-negative constant integer, and u is a non-negative constant integer. It returns a regular expression that contains at least l repetitions of r and at most u repetitions of r. If l >= u, it returns exactly l repetitions of r. 
Regular Expression Loop-2: 
 (re.loop r l)
where r is a regular expression, and l is a non-negative constant integer. It returns a regular expression that contains at least l repetitions of r. 

negative values are invalid, so your example should be invalid. And my examples are also wrong: they should have these constraints as well. Thus for example it should be:

(assert (forall ((x String)(l1 Int)(l2 Int))
                (or
                    (not (and (>= l1 0) (>= l2 0)))
                    (=  (str.in.re x (re.++ (re.loop (str.to.re "a") l1) (re.loop (str.to.re "a") l2)))
                        (str.in.re x (re.loop (str.to.re "a") (+ l1 l2)))
                    )
                )
        )
)
(check-sat)

Of course, loop x 0 0 == str.to.re "" for all regular expressions x. just as loop x 1 1 == x.

pjljvandelaar avatar Sep 20 '19 09:09 pjljvandelaar

In your commit, you miss the mixed terms: concat (loop a l1, loop a l2 u2) == loop a (l1+l2) concat (loop a l1 u1, loop a l2) == loop a (l1+l2) and similiar for loop (loop a l1) l2 u2 == loop a (l1 *l2) loop (loop a l1 u1) l2 == loop a (l1 *l2)

For these rewrite rule (and more) see https://github.com/TorXakis/TorXakis/blob/1530c59c56e1081946fb891ca3b3289b9b092e76/sys/txs-regex/src/TorXakis/Regex/Regex.hs#L149

pjljvandelaar avatar Sep 20 '19 09:09 pjljvandelaar

loop bounds should really be constants and never symbolic for this spec to make sense. Functions are all total so have some meaning. If they are under-specified then a rewrite rule that removes negative bounds is almost for sure going to introduce inconsistencies. I can add rules for constant bounds, but will likely just remove the rules for variables. The standard definition for regular loop is also based on constant bounds. Symbolic bounds goes beyond basic regular expressions.

NikolajBjorner avatar Sep 20 '19 15:09 NikolajBjorner

The provided rules are incorrect as pointed out in https://github.com/CVC4/CVC4/issues/3302#issuecomment-533673395 I missed and QuickCheck did not find the cornercase of same lower and upperbound in the inner loop.

pjljvandelaar avatar Sep 22 '19 08:09 pjljvandelaar

I made a remark on the mixed terms In particular that loop (loop a l1) l2 u2 == loop a (l1 *l2) That statement is only correct when not ( (l2 == 0) and (u2 == 0)) For example "(a{20,}){0,0}"is not equal to "a{0,}" but to "a{0,0}" or just "()" (the empty regex).

pjljvandelaar avatar Sep 23 '19 09:09 pjljvandelaar

even that seems too weak. (loop (loop a 7) 5 5), then length of accepted word has to be divisible by 5. So I am just removing these rewrites alltogether

NikolajBjorner avatar Sep 23 '19 09:09 NikolajBjorner

Your statement is incorrect imho. to satisfy loop a 7, one can choose any number of a's larger than or equal to 7. So if one makes the following five choices 8,7,7,7,7 as required by the outer loop, one get 8+7+7+7+7=36 a's which is NOT divisible by 5, yet adheres to the regex. It is not obligatory to make the same choices within a loop, otherwise ABCDEF would not satisfy [A-Z]+.

Of course, for identifical bound in the innerloop your are right (loop (loop a 5 5 ) 7) has a length divisible by 5 and larger than or equal to 35.

pjljvandelaar avatar Sep 23 '19 11:09 pjljvandelaar

probably incorrect, but there are examples where divisibility constraints are required and the rewrite introduces errors.

(loop (loop "a" 1 200) 5 7)

accepts strings that are divisible by 5, 6, 7, but not by 5 6 7+1. Thus, it cannot be the same as (loop a 5 1400), which includes the number 211.

So for now I prefer holding off with introducing unmotivated unsoundness bugs

NikolajBjorner avatar Sep 23 '19 16:09 NikolajBjorner

Outer loop choose 5. Inner loops choose 200, 1,1,1, 8. So (loop (loop "a" 1 200) 5 7) should accept 211 "a"s. Since in the inner loop, one can choose any value between lower and upperbound, reasoning with divisible by seems extremely difficult to me.

pjljvandelaar avatar Sep 23 '19 20:09 pjljvandelaar

right, my example is bogus but there are other examples (loop (loop "a" 10 11) 2 3) the possible strings have 20, 21, 22, 30, 31, 32, 33 repetitions of "a" while (loop "a" 20 33) admits 25 repetitions of "a".

NikolajBjorner avatar Oct 02 '19 02:10 NikolajBjorner

You are right, when the innerloop is limited, the rewrite is NOT equal So only forall x>=0, y>=0, z > 0: (a{x,}){y,z} == a{x*y,} is valid. Also the concat of loops is valid.

pjljvandelaar avatar Oct 02 '19 06:10 pjljvandelaar

If my mathematics are not failing me. When a>=0 && b>=a && c>=0 && d>=c && (c==d || (b-a)*c >= a-1) then (x{a,b}){c,d} == x{a*c,b*d}

To verify myself, I am now running

prop_Equivalent_LoopNestedCondition :: Bool
prop_Equivalent_LoopNestedCondition =
    all okLower [low .. high]
  where
    low = 0
    high = 25
    
    okLower :: Integer -> Bool
    okLower a = all (lower a) [a..high]
    
    lower :: Integer -> Integer -> Bool
    lower a b = all (okUpper a b) [low..high]

    okUpper :: Integer -> Integer -> Integer -> Bool
    okUpper a b c = all (upper a b c) [c..high]
    
    upper :: Integer -> Integer -> Integer -> Integer -> Bool
    upper a b c d = let l = a*c
                        u = b*d
                       in   
                            if c==d || (b-a)*c >= a-1
                            then all agree [l .. u]         || trace ("(x{" ++ show a ++ "," ++ show b ++ "}){" ++ show c ++ "," ++ show d ++ "} fails on [" ++ show l ++ "," ++ show u ++ "]") False
                            else any (not . agree) [l .. u] || trace ("(x{" ++ show a ++ "," ++ show b ++ "}){" ++ show c ++ "," ++ show d ++ "} succeeds on [" ++ show l ++ "," ++ show u ++ "]") False
            where
                agree :: Integer -> Bool
                agree i = let string = genericReplicate i 'x'
                              regex = "\\`(x{" ++ show a ++ "," ++ show b ++ "}){" ++ show c ++ "," ++ show d ++ "}\\'"
                            in
                              string =~ regex

pjljvandelaar avatar Oct 02 '19 14:10 pjljvandelaar

I just performed some experiments with the nightly build of z3. And found that we have introduced a bug that was not present in version 4.8.6. For example, the regular expression (re.loop (re.loop " ++ innerRegex ++ " 50) 0) incorrectly accepts all strings adhering to (re.loop "++ innerRegex ++" 0). The rewritting is incorrect since although b and d are infinte, c is equal to 0. So filling these values for b,c,d reduces c == d || (b-a)*c>= a-1 to 0 >= a -1. However a is equal to 50, so the rewrite is NOT valid.

I am now starting measurements that could justify the rewrites... Will report on this shortly.

pjljvandelaar avatar Oct 03 '19 11:10 pjljvandelaar

For eleven regexes that can be rewritten (by merging loops and/or making union with empty regex), I compared the performance of the original and rewritten regex. I used two test cases

  • with regex usage via function and
  • direct usage of regex.

I also used two problem solvers

  • Z3 4.8.6 and
  • CVC4 version 1.8-prerelease (Sep 16 2019 11:19:28)

I obtained the following measurements (in ticks)

regex z3 Regex Original z3 Regex Rewritten z3 Function Original z3 Function Rewritten cvc4 Regex Original cvc4 Regex Rewritten cvc4 Function Original cvc4 Function Rewritten
1 3637218 17153202 3246453 18039262 10243106 6123423 11484059 454230
2 7545942 10627769 3905763 10125212 7246320581 6336074 7705778487 6953326
3 12141647 4099657 8335936 4199509 741646 580046 672300 619140
4 7937231 4075041 8152684 3827550 5964467 579418 567861 533432
5 25026158 3946963 21416687 4077249 6650777 580063 1204472 874737
6 26628270 4116598 23193488 7481562 9551972 575448 4048273 544215
7 2670789 4130744 2515477 4021755 166925447 6150023 159349005 5929955
8 7550278 4277873 4202410 4014248 1011020 570116 648623 518884
9 9633485 15990640 9302433 18902759 6189088 504620 782520 445602
10 16116795 3913543 16070033 7307239 1044343 551087 1089125 652364
11 3219156 3953240 3384628 3964950 7438464 494364 1999952 523646

Note that, for CVC4, I stopped the problem solver after approximaltely 15 minutes on the second original regex in both test cases.

Analysis of performance improvement yields

regex Delta z3 Regex Delta z3 Function Delta cvc4 Regex Delta cvc4 Function
1 -3,716022521 -4,556606549 0,4021908 0,960446912
2 -0,408408519 -1,592377469 0,999125615 0,999097648
3 0,662347538 0,496216262 0,217893712 0,079071843
4 0,486591609 0,530516576 0,902855025 0,060629274
5 0,842286499 0,809622796 0,912782672 0,273758958
6 0,845404977 0,677428337 0,939756105 0,865568602
7 -0,546638091 -0,598804123 0,963157067 0,962786369
8 0,433415167 0,044774784 0,436098198 0,200022201
9 -0,659901894 -1,032023128 0,918466178 0,43055513
10 0,757176101 0,545287866 0,472312258 0,401020085
11 -0,228036168 -0,171458134 0,933539505 0,738170716

So surprising to me, not all rewrites improve the performance of all problem solvers. Whereas CVC4 always profit from regex rewritting, z3 does NOT. Any insights on why Z3 might not benefit from rewritting is highly appreciated!

For the haskell file, used to generate the test cases see Lib.hs.txt For the smt2 input files, see regexes.zip For the measurement script, see run.ps1.txt

pjljvandelaar avatar Oct 03 '19 14:10 pjljvandelaar

z3 translates regexes to symbolic automata. The automata are simplified further using heuristic simplification. The regex reasoning on top of those isn't particularly efficient, but rewriting regexes is only complementing the operations at the automaton level. CVC4, to my knowledge, uses Brzozowski style unfolding of the regex directly, so preprocessing the regex would have measurable effect.

NikolajBjorner avatar Oct 08 '19 01:10 NikolajBjorner

FYI @veanes

NikolajBjorner avatar May 23 '21 21:05 NikolajBjorner