regex Matcher: using region might fail to match at start of said region
If I have this:
/** foo
* bar
baz */
with Pattern.compile("^( *)([*] *)?.*$", Pattern.MULTILINE), I have two options (both work in JVM):
- drop 2 characters on either side of comment and use Matcher successfully
- set Matcher region to start at 2 and end 2 short of end, and Matcher skips first line
A couple of notes for future readers, including myself in a month.
-
For me, studying the changes to
MatcherTest.scalawere _essential in understanding both the regex and desired results. -
On the two devices I have used to view this Issue it is hard to determine the exact contents of the multi-line input string. From the the aforesaid
MatcherTestI believe The initial characters of the input string are "/** foo" and the last character is a newline. (On my Linux machine, beware possible Windows differences with line separator). -
I want to run
MatcherTestwith Scala Native and document the defect/failure results here. Everything with due time.
I used a personal variant of the MatcherTest.scala kindly provided in PR #4199
to document the Scala Native 0.5.6 failure:
[error] Test org.scalanative.testsuite.javalib.util.regex.Re2MatcherTest.LeeTmultilineWithRegionLF failed: java.lang.AssertionError: first line, region: YES, expected:<(2,7,2,2,2,4)> but was:<(8,14,8,9,9,11)>, took 0.310 sec
The leftmost two numbers are the start(inclusive) and end (exclusive) of the overall match. The middle two numbers are the start and end of the first leftmost group (group(1)). The rightmost two numbers are the start and end of the rightmost group.
Since the region starts at 2 and the first group is empty at that point, 2 is correct for the both the start of the overall match and the start of the first group. Since the end of that group is also 2, the first group is empty.
Scala Native seems to ignore the possible match of the empty first group and matching second group that happens at the start of its region. It advances to match a found first group at the beginning of the second line.
Once the SN sequence is off, it is all downhill from there. Like Killington in July; rough skiing.
I am currently investigating the role of the initial caret(^) in the pattern. It should have the meaning of "beginning of current region", where the default current region starts at index 0. SN seems to use the meaning "beginning of current input"; i.e. always 0.
Then again, that is what the base Issue said.
My current hypothesis is something like: Mumble, mumble, anchoring flags, mumble, mumble.