regex $ doesn't match CRLF

I created a regex with multi_line set to true, but after debugging why the regex was matching in a unittest but not in a file, I found out that $ isn't matching the end of a line in the file. I'm using Windows so the newlines are \r\n.

Jun 07 '16 22:06 jminer

Could you please provide a test case that doesn't act as you expect?

Jun 07 '16 23:06 BurntSushi

Sure, here is a program that I would expect to print "Matched: true", but it prints "Matched: false":

extern crate regex;

use regex::RegexBuilder;

fn main() {
    let regex = RegexBuilder::new("^apple$").multi_line(true).compile().unwrap();
    let text = "\r\napple\r\nbanana";
    let mut matched = false;
    for _ in regex.captures_iter(text) {
        matched = true;
    }
    println!("Matched: {}", matched);
}

Jun 07 '16 23:06 jminer

That is indeed expected behavior. $ will only match \n in multi line mode. It's not clear to me whether supporting \r\n is feasible unfortunately.

Jun 07 '16 23:06 BurntSushi

@BurntSushi Would treating \r as an end-of-line character, and \n as non-EOL if preceded by \r, be an acceptable change? This should be doable in a DFA, I think, though it does mean two EOL states. It might be somewhat surprising behavior when \r is embedded in a line, but that seems like a much rarer case than EOL \r, and it's not even clear to me that treating carriage return as EOL is actually wrong.

Mar 21 '17 19:03 BatmanAoD

@BatmanAoD That sounds feasible from an implementation perspective, but I'm not a fan of implementing something that's wrong. (Essentially no systems use \r for line endings any more, and on Windows, it's \r\n, not \r.)

Mar 21 '17 20:03 BurntSushi

But the existing implementation is more wrong, so I'm not sure I understand that as an objection.

Mar 21 '17 20:03 BatmanAoD

(Also, just last week I actually did run into something that uses \r on its own as EOL by default: Putty in serial mode! I was shocked.)

Mar 21 '17 20:03 BatmanAoD

I just ran into this when trying to use it to parse some data coming in via HTTP. This is incredibly confusing at the very leasts as this is the only regex engine I know with this behavior.

Mar 22 '17 19:03 mitsuhiko

I believe RE2 also has this behavior. \r?$ may be a not-ideal work around.

On Mar 22, 2017 3:03 PM, "Armin Ronacher" [email protected] wrote:

I just ran into this when trying to use it to parse some data coming in via HTTP. This is incredibly confusing at the very leasts as this is the only regex engine I know with this behavior.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rust-lang/regex/issues/244#issuecomment-288505808, or mute the thread https://github.com/notifications/unsubscribe-auth/AAb34k3E9W0z4yc8ouT2YOg23NVVsbWpks5roXCIgaJpZM4IwcKx .

Mar 22 '17 19:03 BurntSushi

I'm going to re-open this, but I don't have any immediate plans to work on it.

Mar 22 '17 19:03 BurntSushi

Just wanted to throw in that I was wrong. It's indeed the same behavior in Python and Go as well. I did never notice in the former because a similar code I was working with stripped the EOL chars whereas the API I used in rust only split on \n and left an empty \r hanging around.

So when parts of the data were recombined into a string later the \r were left in there.

Mar 22 '17 20:03 mitsuhiko

So the regex behavior here does match other engines dispite what I said earlier.

Mar 22 '17 20:03 mitsuhiko

@mitsuhiko Oh interesting. I should have known for Go, but it's interesting to see that Python doesn't do it either:

>>> import re
>>> re.match('foo$', 'foo\n', re.MULTILINE) is not None
True
>>> re.match('foo$', 'foo\r\n', re.MULTILINE) is not None
False

So I guess we're in good company?

Mar 22 '17 20:03 BurntSushi

@BurntSushi That's odd. I've used re in Python 2.7 on Windows pretty extensively, and I'm sure I've used $, and I know the files I was working with had carriage returns in them. I would have thought I'd notice this! I suppose I must have stripped all my input lines before searching for patterns.

Mar 22 '17 21:03 BatmanAoD

@BatmanAoD In Python, I believe if you open your files with U (universal line mode?), then Python will do something clever automatically. Splitting by line and then searching will probably also do it.

Mar 22 '17 21:03 BurntSushi

/foo$/m in Javascript does match foo\n and foo\r\n.

Mar 22 '17 21:03 BurntSushi

Java's regex also matches \r\n by default:

Pattern p = Pattern.compile("foo$", Pattern.MULTILINE);
System.out.println(p.matcher("foo\r\n").find());

prints true.

Mar 23 '17 02:03 jminer

For Javascript and Java, does \r on its own match as a newline anchor?

\r on its own is apparently considered a newline character by Unicode.

Is there any objection to simply treating every Unicode line terminator character sequence as a match for $? @BurntSushi, I know you said earlier that treating \r on its own as a line terminator would be "wrong", but I'm still not quite sure I see why you'd consider that to be incorrect behavior, even if it's not the norm for regex engines.

Mar 23 '17 07:03 BatmanAoD

JavaScript matches them alone:

> '1\n2\r3\r\n4'.split(/$/m)
[ '1', '\n2', '\r3', '\r', '\n4' ]

Mar 23 '17 08:03 mitsuhiko

@mitsuhiko Hmm. If the interpreter here is correct, JavaScript also returns a four-element array here: '1\r\n\r\n4'.split(/$/m) This is obviously not correct on Windows (there are only two line-endings there).

Mar 23 '17 08:03 BatmanAoD

@BatmanAoD which browser are you using? Chrome, Firefox and Safari gives me 5 elements. Same with JavaScript Core and V8 in node.

Mar 23 '17 08:03 mitsuhiko

Sorry, I meant 5 elements, splitting on each of the \r's and \n's (so 4 matches, which is what I was thinking when I typed that).

There should only be 3 elements (2 matches).

Mar 23 '17 08:03 BatmanAoD

I'm not sure. The above behavior makes perfect sense if you go by unicodes classification of newline characters. I find that behavior quite good because it means that matching with $ works in all newline environments. I know some people say files ending with \r are not common any more but if anyone ever worked with OS X knows that \r newlines are a thing of the present. I come across such files regularly.

Mar 23 '17 08:03 mitsuhiko

I don't believe you're interpreting that list of Unicode newline character-sequences correctly, because they list \r\n as a separate entry from \r. I.e., Unicode considers \r\n to be one single newline (as they must, since Windows is so widely used).

A common pattern is to use ^$ to find blank lines; this would give 100% false positives on Windows using the JavaScript behavior.

Mar 23 '17 08:03 BatmanAoD

Java does exactly the right thing:

Pattern p = Pattern.compile("^$", Pattern.MULTILINE);
Matcher m = p.matcher("1\r\r\n2\r\n\r\n3");
while (m.find()) { System.out.println("match at " + m.start() + ": " + m.group()); }

This prints:

match at 2: 
match at 7:

I.e., the first \r is treated as a newline, after which each group of \r\n together is a single newline.

Mar 23 '17 08:03 BatmanAoD

(To be more precise about what I mean by "the right thing": Java's behavior is, as far as I can tell, exactly equivalent to the behavior we'd get from implementing my proposal of treating \r like a newline all the time and \n like a newline except when preceded by \r. This behavior also matches what I would expect a regex engine to do, though as I've learned today, clearly many do not behave this way!)

Mar 23 '17 08:03 BatmanAoD

Unicode does not define control characters. Unicode only has recommendations on newline handling and the recommendations and those would be "convert from platform newline characters to LS or PS" and then back which I think nobody does. So I think unicode in itself is unclear on it. However unicode has guidelines on regular expressions:

These two things apply:

Line Boundaries To meet this requirement, if an implementation provides for line-boundary testing, it shall recognize not only CRLF, LF, CR, but also NEL (U+0085), PARAGRAPH SEPARATOR (U+2029) and LINE SEPARATOR (U+2028).

as well as

It is strongly recommended that there be a regular expression meta-character, such as "\R", for matching all line ending characters and sequences listed above (for example, in #1). This would correspond to something equivalent to the following expression. That expression is slightly complicated by the need to avoid backup.

(?:\u{D A}|(?!\u{D A})[\u{A}-\u{D}\u{85}\u{2028}\u{2029}]

Note: For some implementations, there may be a performance impact in recognizing CRLF as a single entity, such as with an arbitrary pattern character ("."). To account for that, an implementation may also satisfy R1.6 if there is a mechanism available for converting the sequence CRLF to a single line boundary character before regex processing.

WRT to Java behavior from above it yields this (in pseudocode):

> '1\n2\r3\r\n4'.split(/$/m)
['1', '\n2', '\r3', '\r\n4']

Mar 23 '17 08:03 mitsuhiko

Also something to add to this CRLF business:

Arbitrary character pattern (often ".") Where the 'arbitrary character pattern' matches a newline sequence, it must match all of the newline sequences, and \u{D A} (CRLF) should match as if it were a single character. (The recommendation that CRLF match as a single character is, however, not required for conformance to RL1.6.) Note that ^$ (an empty line pattern) should not match the empty string within the sequence \u{D A}, but should match the empty string within the reversed sequence \u{A D}.

All of this is from here: http://unicode.org/reports/tr18/

I guess my recommendation would be to change the behavior to handle \r and \n as newline characters on their own right, to leave CRLF essentially unhandled and recommend people to normalize CRLF to LF before processing if they need to match them as a single character.

Mar 23 '17 08:03 mitsuhiko

Also one more thing and I shut up: it turns out that .NET which is probably one of the highest authorities on windows newline handling has the same behavior as python, this crate and go: https://msdn.microsoft.com/en-us/library/yd1hzczs.aspx#Multiline

By default, $ matches only the end of the input string. If you specify the RegexOptions.Multiline option, it matches either the newline character (\n) or the end of the input string. It does not, however, match the carriage return/line feed character combination. To successfully match them, use the subexpression \r?$ instead of just $.

Mar 23 '17 08:03 mitsuhiko

I'm not sure what you mean by "Unicode does not define control characters", since it certainly specifies control character definitions (inheriting some from ASCII and introducing some others, such as the text-direction characters).

By treating \r as a newline but not matching \n as a newline when it immediately follows \r, it looks like we'd meet all the Unicode recommendations except causing . to match \r\n (which would be in a completely separate part of the input logic anyway, since . isn't a zero-width match).

Again, your recommendation of letting \r\n match two separate newlines (i.e. not including special handling for \n following \r) would have the wrong behavior for the ^$ pattern, which is explicitly called out in the text you quoted.

.......but the fact that Microsoft can't be bothered to get this right in .NET just blows me away. Ugh.

Mar 23 '17 08:03 BatmanAoD