regex icon indicating copy to clipboard operation
regex copied to clipboard

incorrect case for word boundaries

Open BurntSushi opened this issue 5 years ago • 5 comments

Here's a reproduction:

use regex::bytes::Regex;

fn main() {
    let hay = "I have 12, he has 2!";
    let re = Regex::new(r"\b..\b").unwrap();
    for m in re.find_iter(hay.as_bytes()) {
        println!("{:?}", String::from_utf8_lossy(m.as_bytes()));
    }
}

Actual output:

"I "
"12"

Expected output:

"I "
"12"
", "
"he"
" 2"

Playground link: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=55914c890dfb6a68fc72b9c6fd986298

The same bug is present even if we use ASCII word boundaries: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=eef23f309c9f608eb683aac982648301

Here's a smaller reproduction:

use regex::bytes::Regex;

fn main() {
    let hay = "az,,b";
    let re = Regex::new(r"\b..\b").unwrap();
    for m in re.find_iter(hay.as_bytes()) {
        println!("{:?}", String::from_utf8_lossy(m.as_bytes()));
    }
}

Playground link: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=c7507e4d095141004909f9deb1c6cdd7

Originally reported against ripgrep: https://github.com/BurntSushi/ripgrep/issues/1275

BurntSushi avatar May 08 '19 13:05 BurntSushi

I figured out how this bug occurs. When a match is found on a forward DFA, reverse DFA matching is done on &text[start..] https://github.com/rust-lang/regex/blob/9921922a02affbf1f7cb5f114826d068d4448de7/src/exec.rs#L704-L716 However, a word boundary check cannot be done only with the substring. For example, when text = "a," and start = 1 then &text[start..] = "," but its index 0 (EOF for reverse matching) should be treated as a word boundary.

pcpthm avatar Jun 10 '19 16:06 pcpthm

@pcpthm, so how should we proceed?

sergeevabc avatar Feb 27 '20 11:02 sergeevabc

@sergeevabc I was not completely sure how to fix it and this is why I just commented on this issue and didn't write a fix. An idea is to modify the Byte struct https://github.com/rust-lang/regex/blob/a0f541bd707a39094d839c1ffd0141d27fe40681/src/dfa.rs#L1725 so that there are two special eof Bytes where one returns true to .is_ascii_word() and other one returns false. Then, here https://github.com/rust-lang/regex/blob/a0f541bd707a39094d839c1ffd0141d27fe40681/src/dfa.rs#L834, we have to use the appropriate type of eof depending on whether "text[-1]" is a word-byte or not (of course a slice cannot index at negative so we have to pass it as an additional argument). We have to account for the two kinds of EOF bytes for DFA table layouts e.g. https://github.com/rust-lang/regex/blob/a0f541bd707a39094d839c1ffd0141d27fe40681/src/dfa.rs#L1522-L1525 and it might lead to slight performance degradation in some case. I'm not sure it can be improved.

pcpthm avatar Feb 27 '20 16:02 pcpthm

This will be fixed as part of #656.

BurntSushi avatar Mar 29 '20 03:03 BurntSushi

I found a problem with \b - is this same issue as discussed here?

This returns None even though there are two possible matches at locations 3 and 4.

fn main() {
    let re = regex::Regex::new(r"\b.").unwrap();
    println!("{:?}", re.find_at("foo bar", 3));
}

Playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=3892e7b72507f13b32a1110e94aee066

malaire avatar Jan 04 '21 16:01 malaire