$ doesn't match CRLF
I created a regex with multi_line set to true, but after debugging why the regex was matching in a unittest but not in a file, I found out that $ isn't matching the end of a line in the file. I'm using Windows so the newlines are \r\n.
Could you please provide a test case that doesn't act as you expect?
Sure, here is a program that I would expect to print "Matched: true", but it prints "Matched: false":
extern crate regex;
use regex::RegexBuilder;
fn main() {
let regex = RegexBuilder::new("^apple$").multi_line(true).compile().unwrap();
let text = "\r\napple\r\nbanana";
let mut matched = false;
for _ in regex.captures_iter(text) {
matched = true;
}
println!("Matched: {}", matched);
}
That is indeed expected behavior. $ will only match \n in multi line mode. It's not clear to me whether supporting \r\n is feasible unfortunately.
@BurntSushi Would treating \r as an end-of-line character, and \n as non-EOL if preceded by \r, be an acceptable change? This should be doable in a DFA, I think, though it does mean two EOL states. It might be somewhat surprising behavior when \r is embedded in a line, but that seems like a much rarer case than EOL \r, and it's not even clear to me that treating carriage return as EOL is actually wrong.
@BatmanAoD That sounds feasible from an implementation perspective, but I'm not a fan of implementing something that's wrong. (Essentially no systems use \r for line endings any more, and on Windows, it's \r\n, not \r.)
But the existing implementation is more wrong, so I'm not sure I understand that as an objection.
(Also, just last week I actually did run into something that uses \r on its own as EOL by default: Putty in serial mode! I was shocked.)
I just ran into this when trying to use it to parse some data coming in via HTTP. This is incredibly confusing at the very leasts as this is the only regex engine I know with this behavior.
I believe RE2 also has this behavior. \r?$ may be a not-ideal work around.
On Mar 22, 2017 3:03 PM, "Armin Ronacher" [email protected] wrote:
I just ran into this when trying to use it to parse some data coming in via HTTP. This is incredibly confusing at the very leasts as this is the only regex engine I know with this behavior.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rust-lang/regex/issues/244#issuecomment-288505808, or mute the thread https://github.com/notifications/unsubscribe-auth/AAb34k3E9W0z4yc8ouT2YOg23NVVsbWpks5roXCIgaJpZM4IwcKx .
I'm going to re-open this, but I don't have any immediate plans to work on it.
Just wanted to throw in that I was wrong. It's indeed the same behavior in Python and Go as well. I did never notice in the former because a similar code I was working with stripped the EOL chars whereas the API I used in rust only split on \n and left an empty \r hanging around.
So when parts of the data were recombined into a string later the \r were left in there.
So the regex behavior here does match other engines dispite what I said earlier.
@mitsuhiko Oh interesting. I should have known for Go, but it's interesting to see that Python doesn't do it either:
>>> import re
>>> re.match('foo$', 'foo\n', re.MULTILINE) is not None
True
>>> re.match('foo$', 'foo\r\n', re.MULTILINE) is not None
False
So I guess we're in good company?
@BurntSushi That's odd. I've used re in Python 2.7 on Windows pretty extensively, and I'm sure I've used $, and I know the files I was working with had carriage returns in them. I would have thought I'd notice this! I suppose I must have stripped all my input lines before searching for patterns.
@BatmanAoD In Python, I believe if you open your files with U (universal line mode?), then Python will do something clever automatically. Splitting by line and then searching will probably also do it.
/foo$/m in Javascript does match foo\n and foo\r\n.
Java's regex also matches \r\n by default:
Pattern p = Pattern.compile("foo$", Pattern.MULTILINE);
System.out.println(p.matcher("foo\r\n").find());
prints true.
For Javascript and Java, does \r on its own match as a newline anchor?
\r on its own is apparently considered a newline character by Unicode.
Is there any objection to simply treating every Unicode line terminator character sequence as a match for $? @BurntSushi, I know you said earlier that treating \r on its own as a line terminator would be "wrong", but I'm still not quite sure I see why you'd consider that to be incorrect behavior, even if it's not the norm for regex engines.
JavaScript matches them alone:
> '1\n2\r3\r\n4'.split(/$/m)
[ '1', '\n2', '\r3', '\r', '\n4' ]
@mitsuhiko Hmm. If the interpreter here is correct, JavaScript also returns a four-element array here: '1\r\n\r\n4'.split(/$/m) This is obviously not correct on Windows (there are only two line-endings there).
@BatmanAoD which browser are you using? Chrome, Firefox and Safari gives me 5 elements. Same with JavaScript Core and V8 in node.
Sorry, I meant 5 elements, splitting on each of the \r's and \n's (so 4 matches, which is what I was thinking when I typed that).
There should only be 3 elements (2 matches).
I'm not sure. The above behavior makes perfect sense if you go by unicodes classification of newline characters. I find that behavior quite good because it means that matching with $ works in all newline environments. I know some people say files ending with \r are not common any more but if anyone ever worked with OS X knows that \r newlines are a thing of the present. I come across such files regularly.
I don't believe you're interpreting that list of Unicode newline character-sequences correctly, because they list \r\n as a separate entry from \r. I.e., Unicode considers \r\n to be one single newline (as they must, since Windows is so widely used).
A common pattern is to use ^$ to find blank lines; this would give 100% false positives on Windows using the JavaScript behavior.
Java does exactly the right thing:
Pattern p = Pattern.compile("^$", Pattern.MULTILINE);
Matcher m = p.matcher("1\r\r\n2\r\n\r\n3");
while (m.find()) { System.out.println("match at " + m.start() + ": " + m.group()); }
This prints:
match at 2:
match at 7:
I.e., the first \r is treated as a newline, after which each group of \r\n together is a single newline.
(To be more precise about what I mean by "the right thing": Java's behavior is, as far as I can tell, exactly equivalent to the behavior we'd get from implementing my proposal of treating \r like a newline all the time and \n like a newline except when preceded by \r. This behavior also matches what I would expect a regex engine to do, though as I've learned today, clearly many do not behave this way!)
Unicode does not define control characters. Unicode only has recommendations on newline handling and the recommendations and those would be "convert from platform newline characters to LS or PS" and then back which I think nobody does. So I think unicode in itself is unclear on it. However unicode has guidelines on regular expressions:
These two things apply:
Line Boundaries To meet this requirement, if an implementation provides for line-boundary testing, it shall recognize not only CRLF, LF, CR, but also NEL (U+0085), PARAGRAPH SEPARATOR (U+2029) and LINE SEPARATOR (U+2028).
as well as
It is strongly recommended that there be a regular expression meta-character, such as "\R", for matching all line ending characters and sequences listed above (for example, in #1). This would correspond to something equivalent to the following expression. That expression is slightly complicated by the need to avoid backup.
(?:\u{D A}|(?!\u{D A})[\u{A}-\u{D}\u{85}\u{2028}\u{2029}]Note: For some implementations, there may be a performance impact in recognizing CRLF as a single entity, such as with an arbitrary pattern character ("."). To account for that, an implementation may also satisfy R1.6 if there is a mechanism available for converting the sequence CRLF to a single line boundary character before regex processing.
WRT to Java behavior from above it yields this (in pseudocode):
> '1\n2\r3\r\n4'.split(/$/m)
['1', '\n2', '\r3', '\r\n4']
Also something to add to this CRLF business:
Arbitrary character pattern (often ".") Where the 'arbitrary character pattern' matches a newline sequence, it must match all of the newline sequences, and \u{D A} (CRLF) should match as if it were a single character. (The recommendation that CRLF match as a single character is, however, not required for conformance to RL1.6.) Note that ^$ (an empty line pattern) should not match the empty string within the sequence \u{D A}, but should match the empty string within the reversed sequence \u{A D}.
All of this is from here: http://unicode.org/reports/tr18/
I guess my recommendation would be to change the behavior to handle \r and \n as newline characters on their own right, to leave CRLF essentially unhandled and recommend people to normalize CRLF to LF before processing if they need to match them as a single character.
Also one more thing and I shut up: it turns out that .NET which is probably one of the highest authorities on windows newline handling has the same behavior as python, this crate and go: https://msdn.microsoft.com/en-us/library/yd1hzczs.aspx#Multiline
By default, $ matches only the end of the input string. If you specify the RegexOptions.Multiline option, it matches either the newline character (\n) or the end of the input string. It does not, however, match the carriage return/line feed character combination. To successfully match them, use the subexpression \r?$ instead of just $.
I'm not sure what you mean by "Unicode does not define control characters", since it certainly specifies control character definitions (inheriting some from ASCII and introducing some others, such as the text-direction characters).
By treating \r as a newline but not matching \n as a newline when it immediately follows \r, it looks like we'd meet all the Unicode recommendations except causing . to match \r\n (which would be in a completely separate part of the input logic anyway, since . isn't a zero-width match).
Again, your recommendation of letting \r\n match two separate newlines (i.e. not including special handling for \n following \r) would have the wrong behavior for the ^$ pattern, which is explicitly called out in the text you quoted.
.......but the fact that Microsoft can't be bothered to get this right in .NET just blows me away. Ugh.