ostrich icon indicating copy to clipboard operation
ostrich copied to clipboard

Three problems about regular expressions

Open x5g opened this issue 3 years ago • 4 comments

Hi, I have three questions about regular expressions,

  1. What is the difference between re. from_ecma2020 and re.from.str ?

  2. Does OSTRICH currently support regular expressions with backreferences? E.g., (.)\1.

  3. I found that (?=a).|c -> (model (define-fun w () String "a")), but ((?=a).|c) -> (model (define-fun w () String "\u{0}")), but \u0000 is not accepted by ((?=a).|c).

x5g avatar Apr 21 '22 09:04 x5g

Hi, I have three questions about regular expressions,

Hi! Since nobody's responded to you I'll do my best.

  1. What is the difference between re. from_ecma2020 and re.from.str ?

re.from_str should either convert to/from textbook regular expressions (i.e. not the full ECMA standard), or it's a different kind of parser. Essentially what you want is from_ecma2020, IIRC (@uuverifiers can fill you in!), it should be strictly better.

  1. Does OSTRICH currently support regular expressions with backreferences? E.g., (.)\1.

Not in general since they in the general case make the language not regular and therefore cannot be handled in Ostrich, but @RiccardoDeMasellis and @uuverifiers are working on accommodating some practical use cases. I don't remember how much of that is in the main tree yet though.

  1. I found that (?=a).|c -> (model (define-fun w () String "a")), but ((?=a).|c) -> (model (define-fun w () String "\u{0}")), but \u0000 is not accepted by ((?=a).|c).

That sounds like a bug to me. I know we have had some problems with look-arounds in general so this might be another one.

amandasystems avatar Apr 26 '22 11:04 amandasystems

Sorry for the delay!

Concerning 2), handling back-references is not something we are currently working on; adding support for back-references is one of our long-term goals though.

Concerning 3), I get a correct model for this pattern when using the function re.from_ecma2020. It is true, however, that we are still working on full support for anchors and lookarounds, currently there are still some cases of ECMAScript regular expressions not handled correctly.

pruemmer avatar Apr 26 '22 12:04 pruemmer

Thank you for your answers. @amandasystems @pruemmer

Supporting backreferences is really difficult since it makes the regex language not regular.

For question 3, I wonder if, assuming that the regex is parsed well, i.e., L(r) is converted to an equivalent SMT constraint S. When I solve S using OSTRICH, is its model always consistent with L(r)?

Again, thank you for your works. :)

x5g avatar Apr 27 '22 03:04 x5g

There are still some cases involving anchors or lookarounds that are not handled correctly, even with the re.from_ecma2020 function. This is something we are currently working on, and will hopefully be able to resolve soon!

pruemmer avatar Apr 27 '22 12:04 pruemmer