Non-Greedy *? in SRFI-115 Matches Greedily
The non-greedy *? operator in SRFI-115 does not behave as expected. According to SRFI-115's specification non-greedy patterns should follow leftmost-shortest semantics.
Steps to reproduce:
(import (chibi regexp))
(regexp-extract '(: "a" (*? any) "z") "a-z-z-a")
Expected output:
("a-z")
Actual output:
("a-z-z")
Additional Context
The issue was originally observed in Chez Scheme’s SRFI-115 implementation.
Discussion in #scheme IRC suggested this may be a broader issue with SRFI-115's reference implementation: https://paste.jrvieira.com/1743421156171
Relevant parts:
[22:11:04] <Zipheir> Also (regexp-extract (rx "a" (*? any) "-") "a-z-a") => ("a-z-"), which is definitely not what I'd expect.
[22:12:32] <Zipheir> CHICKEN's irregex returns ("a-"). I guess there's something going on with the SRFI 115 implementation.
[22:14:39] <Zipheir> chibi's (srfi 115) is also affected.
...
[22:50:15] <Zipheir> zzz: With cond-expand from (srfi :0) I get (cond-expand (regexp-non-greedy #t) (else #f)) => #f
Thanks for the report! There are passing tests for non-greedy matching, I'm not sure why this case fails. I think it's because the regexp as a whole is still greedy. Since it matches "a-z-z" it therefore must match the whole string, forcing the matching on the non-greedy component. So although unexpected this may actually be the correct semantics. You can see how this works in the other test cases.
There's currently no way to set the regexp itself to non-greedy, although this could be added (non-portably).
Note also that non-greedy matching is only optionally supported by SRFI 115, so is best avoided. There's usually an alternative way to achieve the match that you want.