regex
regex copied to clipboard
incorrect case for word boundaries
Here's a reproduction:
use regex::bytes::Regex;
fn main() {
let hay = "I have 12, he has 2!";
let re = Regex::new(r"\b..\b").unwrap();
for m in re.find_iter(hay.as_bytes()) {
println!("{:?}", String::from_utf8_lossy(m.as_bytes()));
}
}
Actual output:
"I "
"12"
Expected output:
"I "
"12"
", "
"he"
" 2"
Playground link: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=55914c890dfb6a68fc72b9c6fd986298
The same bug is present even if we use ASCII word boundaries: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=eef23f309c9f608eb683aac982648301
Here's a smaller reproduction:
use regex::bytes::Regex;
fn main() {
let hay = "az,,b";
let re = Regex::new(r"\b..\b").unwrap();
for m in re.find_iter(hay.as_bytes()) {
println!("{:?}", String::from_utf8_lossy(m.as_bytes()));
}
}
Playground link: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=c7507e4d095141004909f9deb1c6cdd7
Originally reported against ripgrep: https://github.com/BurntSushi/ripgrep/issues/1275
I figured out how this bug occurs.
When a match is found on a forward DFA, reverse DFA matching is done on &text[start..]
https://github.com/rust-lang/regex/blob/9921922a02affbf1f7cb5f114826d068d4448de7/src/exec.rs#L704-L716
However, a word boundary check cannot be done only with the substring. For example, when text = "a,"
and start = 1
then &text[start..] = ","
but its index 0 (EOF for reverse matching) should be treated as a word boundary.
@pcpthm, so how should we proceed?
@sergeevabc I was not completely sure how to fix it and this is why I just commented on this issue and didn't write a fix.
An idea is to modify the Byte
struct https://github.com/rust-lang/regex/blob/a0f541bd707a39094d839c1ffd0141d27fe40681/src/dfa.rs#L1725 so that there are two special eof
Byte
s where one returns true
to .is_ascii_word()
and other one returns false
. Then, here https://github.com/rust-lang/regex/blob/a0f541bd707a39094d839c1ffd0141d27fe40681/src/dfa.rs#L834, we have to use the appropriate type of eof
depending on whether "text[-1]
" is a word-byte or not (of course a slice cannot index at negative so we have to pass it as an additional argument). We have to account for the two kinds of EOF bytes for DFA table layouts e.g. https://github.com/rust-lang/regex/blob/a0f541bd707a39094d839c1ffd0141d27fe40681/src/dfa.rs#L1522-L1525 and it might lead to slight performance degradation in some case. I'm not sure it can be improved.
This will be fixed as part of #656.
I found a problem with \b
- is this same issue as discussed here?
This returns None
even though there are two possible matches at locations 3 and 4.
fn main() {
let re = regex::Regex::new(r"\b.").unwrap();
println!("{:?}", re.find_at("foo bar", 3));
}
Playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=3892e7b72507f13b32a1110e94aee066