z3 Reasoning with Regex

The following smt input

(set-info :smt-lib-version 2.5)
(assert (forall ((x String)) 
                (=  (str.in.re x (re.++ (str.to.re "a") (re.* (str.to.re "a"))))
                    (str.in.re x (re.+ (str.to.re "a")))
                )
        )
)
(check-sat)

produces sat. Similar

(assert (forall ((x String))
                (=  (str.in.re x (re.union (re.++ (str.to.re "a") (str.to.re "b")) 
                                           (re.++ (str.to.re "a") (str.to.re "c"))
                                 )
                    )                                 
                    (str.in.re x (re.++ (str.to.re "a") (re.union (str.to.re "b")
                                                                  (str.to.re "c")
                                                        )
                                 )
                    )
                )
        )
)
(check-sat)

and

(assert (forall ((x String)) 
                (=  (str.in.re x (re.loop (str.to.re "a") 5 5))
                    (str.in.re x (str.to.re "aaaaa"))
                )
        )
)
(check-sat)

both produce sat. So there seems to be some sort of symbolic rewriting of Regexes to canical form, such that these cases are solvable. However some more experimenting with

(assert (forall ((x String)(l1 Int)(l2 Int))
                (=  (str.in.re x (re.++ (re.loop (str.to.re "a") l1) (re.loop (str.to.re "a") l2)))
                    (str.in.re x (re.loop (str.to.re "a") (+ l1 l2)))
                )
        )
)
(check-sat)

(assert (forall ((x String)(l1 Int)(l2 Int)(u1 Int)(u2 Int))
                (=  (str.in.re x (re.++ (re.loop (str.to.re "a") l1 u1) (re.loop (str.to.re "a") l2 u2)))
                    (str.in.re x (re.loop (str.to.re "a") (+ l1 l2) (+ u1 u2)))
                )
        )
)
(check-sat)

and

(assert (forall ((x String)(l1 Int)(l2 Int)(u1 Int)(u2 Int))
                (=  (str.in.re x (re.loop (re.loop (str.to.re "a") l1 u1) l2 u2))
                    (str.in.re x (re.loop (str.to.re "a") (* l1 l2) (* u1 u2)))
                )
        )
)
(check-sat)

all produce unknown. Are these symbolic rewrite rules missing? Are these regular expressions not simplified before solving?

Can this functionality be added since it helps in checking correctness of my programs?

Sep 20 '19 08:09 pjljvandelaar

Note that with Z3str3 (i.e. z3 smt.string_solver=z3str3) I get two crashes instead of unknown in the last two smt examples!

Sep 20 '19 08:09 pjljvandelaar

Not sure these rules should be valid. what is (loop a -1 -1) ++ (loop a 1 1) ? is it the empty RE or the singleton RE comprising of a.

Sep 20 '19 09:09 NikolajBjorner

According to http://cvc4.cs.stanford.edu/wiki/Strings

Regular Expression Loop: 
 (re.loop r l u)
where r is a regular expression, l is a non-negative constant integer, and u is a non-negative constant integer. It returns a regular expression that contains at least l repetitions of r and at most u repetitions of r. If l >= u, it returns exactly l repetitions of r. 
Regular Expression Loop-2: 
 (re.loop r l)
where r is a regular expression, and l is a non-negative constant integer. It returns a regular expression that contains at least l repetitions of r.

negative values are invalid, so your example should be invalid. And my examples are also wrong: they should have these constraints as well. Thus for example it should be:

(assert (forall ((x String)(l1 Int)(l2 Int))
                (or
                    (not (and (>= l1 0) (>= l2 0)))
                    (=  (str.in.re x (re.++ (re.loop (str.to.re "a") l1) (re.loop (str.to.re "a") l2)))
                        (str.in.re x (re.loop (str.to.re "a") (+ l1 l2)))
                    )
                )
        )
)
(check-sat)

Of course, loop x 0 0 == str.to.re "" for all regular expressions x. just as loop x 1 1 == x.

Sep 20 '19 09:09 pjljvandelaar

In your commit, you miss the mixed terms: concat (loop a l1, loop a l2 u2) == loop a (l1+l2) concat (loop a l1 u1, loop a l2) == loop a (l1+l2) and similiar for loop (loop a l1) l2 u2 == loop a (l1 *l2) loop (loop a l1 u1) l2 == loop a (l1 *l2)

For these rewrite rule (and more) see https://github.com/TorXakis/TorXakis/blob/1530c59c56e1081946fb891ca3b3289b9b092e76/sys/txs-regex/src/TorXakis/Regex/Regex.hs#L149

Sep 20 '19 09:09 pjljvandelaar

loop bounds should really be constants and never symbolic for this spec to make sense. Functions are all total so have some meaning. If they are under-specified then a rewrite rule that removes negative bounds is almost for sure going to introduce inconsistencies. I can add rules for constant bounds, but will likely just remove the rules for variables. The standard definition for regular loop is also based on constant bounds. Symbolic bounds goes beyond basic regular expressions.

Sep 20 '19 15:09 NikolajBjorner

The provided rules are incorrect as pointed out in https://github.com/CVC4/CVC4/issues/3302#issuecomment-533673395 I missed and QuickCheck did not find the cornercase of same lower and upperbound in the inner loop.

Sep 22 '19 08:09 pjljvandelaar

I made a remark on the mixed terms In particular that loop (loop a l1) l2 u2 == loop a (l1 *l2) That statement is only correct when not ( (l2 == 0) and (u2 == 0)) For example "(a{20,}){0,0}"is not equal to "a{0,}" but to "a{0,0}" or just "()" (the empty regex).

Sep 23 '19 09:09 pjljvandelaar

even that seems too weak. (loop (loop a 7) 5 5), then length of accepted word has to be divisible by 5. So I am just removing these rewrites alltogether

Sep 23 '19 09:09 NikolajBjorner

Your statement is incorrect imho. to satisfy loop a 7, one can choose any number of a's larger than or equal to 7. So if one makes the following five choices 8,7,7,7,7 as required by the outer loop, one get 8+7+7+7+7=36 a's which is NOT divisible by 5, yet adheres to the regex. It is not obligatory to make the same choices within a loop, otherwise ABCDEF would not satisfy [A-Z]+.

Of course, for identifical bound in the innerloop your are right (loop (loop a 5 5 ) 7) has a length divisible by 5 and larger than or equal to 35.

Sep 23 '19 11:09 pjljvandelaar

probably incorrect, but there are examples where divisibility constraints are required and the rewrite introduces errors.

(loop (loop "a" 1 200) 5 7)

accepts strings that are divisible by 5, 6, 7, but not by 5 6 7+1. Thus, it cannot be the same as (loop a 5 1400), which includes the number 211.

So for now I prefer holding off with introducing unmotivated unsoundness bugs

Sep 23 '19 16:09 NikolajBjorner

Outer loop choose 5. Inner loops choose 200, 1,1,1, 8. So (loop (loop "a" 1 200) 5 7) should accept 211 "a"s. Since in the inner loop, one can choose any value between lower and upperbound, reasoning with divisible by seems extremely difficult to me.

Sep 23 '19 20:09 pjljvandelaar

right, my example is bogus but there are other examples (loop (loop "a" 10 11) 2 3) the possible strings have 20, 21, 22, 30, 31, 32, 33 repetitions of "a" while (loop "a" 20 33) admits 25 repetitions of "a".

Oct 02 '19 02:10 NikolajBjorner

You are right, when the innerloop is limited, the rewrite is NOT equal So only forall x>=0, y>=0, z > 0: (a{x,}){y,z} == a{x*y,} is valid. Also the concat of loops is valid.

Oct 02 '19 06:10 pjljvandelaar

If my mathematics are not failing me. When a>=0 && b>=a && c>=0 && d>=c && (c==d || (b-a)*c >= a-1) then (x{a,b}){c,d} == x{a*c,b*d}

To verify myself, I am now running

prop_Equivalent_LoopNestedCondition :: Bool
prop_Equivalent_LoopNestedCondition =
    all okLower [low .. high]
  where
    low = 0
    high = 25
    
    okLower :: Integer -> Bool
    okLower a = all (lower a) [a..high]
    
    lower :: Integer -> Integer -> Bool
    lower a b = all (okUpper a b) [low..high]

    okUpper :: Integer -> Integer -> Integer -> Bool
    okUpper a b c = all (upper a b c) [c..high]
    
    upper :: Integer -> Integer -> Integer -> Integer -> Bool
    upper a b c d = let l = a*c
                        u = b*d
                       in   
                            if c==d || (b-a)*c >= a-1
                            then all agree [l .. u]         || trace ("(x{" ++ show a ++ "," ++ show b ++ "}){" ++ show c ++ "," ++ show d ++ "} fails on [" ++ show l ++ "," ++ show u ++ "]") False
                            else any (not . agree) [l .. u] || trace ("(x{" ++ show a ++ "," ++ show b ++ "}){" ++ show c ++ "," ++ show d ++ "} succeeds on [" ++ show l ++ "," ++ show u ++ "]") False
            where
                agree :: Integer -> Bool
                agree i = let string = genericReplicate i 'x'
                              regex = "\\`(x{" ++ show a ++ "," ++ show b ++ "}){" ++ show c ++ "," ++ show d ++ "}\\'"
                            in
                              string =~ regex

Oct 02 '19 14:10 pjljvandelaar

I just performed some experiments with the nightly build of z3. And found that we have introduced a bug that was not present in version 4.8.6. For example, the regular expression (re.loop (re.loop " ++ innerRegex ++ " 50) 0) incorrectly accepts all strings adhering to (re.loop "++ innerRegex ++" 0). The rewritting is incorrect since although b and d are infinte, c is equal to 0. So filling these values for b,c,d reduces c == d || (b-a)*c>= a-1 to 0 >= a -1. However a is equal to 50, so the rewrite is NOT valid.

I am now starting measurements that could justify the rewrites... Will report on this shortly.

Oct 03 '19 11:10 pjljvandelaar

For eleven regexes that can be rewritten (by merging loops and/or making union with empty regex), I compared the performance of the original and rewritten regex. I used two test cases

with regex usage via function and
direct usage of regex.

I also used two problem solvers

Z3 4.8.6 and
CVC4 version 1.8-prerelease (Sep 16 2019 11:19:28)

I obtained the following measurements (in ticks)

regex	z3 Regex Original	z3 Regex Rewritten	z3 Function Original	z3 Function Rewritten	cvc4 Regex Original	cvc4 Regex Rewritten	cvc4 Function Original	cvc4 Function Rewritten
1	3637218	17153202	3246453	18039262	10243106	6123423	11484059	454230
2	7545942	10627769	3905763	10125212	7246320581	6336074	7705778487	6953326
3	12141647	4099657	8335936	4199509	741646	580046	672300	619140
4	7937231	4075041	8152684	3827550	5964467	579418	567861	533432
5	25026158	3946963	21416687	4077249	6650777	580063	1204472	874737
6	26628270	4116598	23193488	7481562	9551972	575448	4048273	544215
7	2670789	4130744	2515477	4021755	166925447	6150023	159349005	5929955
8	7550278	4277873	4202410	4014248	1011020	570116	648623	518884
9	9633485	15990640	9302433	18902759	6189088	504620	782520	445602
10	16116795	3913543	16070033	7307239	1044343	551087	1089125	652364
11	3219156	3953240	3384628	3964950	7438464	494364	1999952	523646

Note that, for CVC4, I stopped the problem solver after approximaltely 15 minutes on the second original regex in both test cases.

Analysis of performance improvement yields

regex	Delta z3 Regex	Delta z3 Function	Delta cvc4 Regex	Delta cvc4 Function
1	-3,716022521	-4,556606549	0,4021908	0,960446912
2	-0,408408519	-1,592377469	0,999125615	0,999097648
3	0,662347538	0,496216262	0,217893712	0,079071843
4	0,486591609	0,530516576	0,902855025	0,060629274
5	0,842286499	0,809622796	0,912782672	0,273758958
6	0,845404977	0,677428337	0,939756105	0,865568602
7	-0,546638091	-0,598804123	0,963157067	0,962786369
8	0,433415167	0,044774784	0,436098198	0,200022201
9	-0,659901894	-1,032023128	0,918466178	0,43055513
10	0,757176101	0,545287866	0,472312258	0,401020085
11	-0,228036168	-0,171458134	0,933539505	0,738170716

So surprising to me, not all rewrites improve the performance of all problem solvers. Whereas CVC4 always profit from regex rewritting, z3 does NOT. Any insights on why Z3 might not benefit from rewritting is highly appreciated!

For the haskell file, used to generate the test cases see Lib.hs.txt For the smt2 input files, see regexes.zip For the measurement script, see run.ps1.txt

Oct 03 '19 14:10 pjljvandelaar

z3 translates regexes to symbolic automata. The automata are simplified further using heuristic simplification. The regex reasoning on top of those isn't particularly efficient, but rewriting regexes is only complementing the operations at the automaton level. CVC4, to my knowledge, uses Brzozowski style unfolding of the regex directly, so preprocessing the regex would have measurable effect.

Oct 08 '19 01:10 NikolajBjorner

FYI @veanes

May 23 '21 21:05 NikolajBjorner

z3 z3 copied to clipboard

Reasoning with Regex

z3
z3 copied to clipboard