voca_rs icon indicating copy to clipboard operation
voca_rs copied to clipboard

U+200D (zero-width joiner) breaks the parsing

Open mexus opened this issue 6 months ago • 1 comments

Long story short:

fn main() {
    assert_eq!(voca_rs::strip::strip_tags("<p>\u{200D}</p>after"), "after");
}

Leads to

thread 'main' panicked at src/main.rs:2:5:
assertion `left == right` failed
  left: ""
 right: "after"
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I believe it is caused by the following fact:

use unicode_segmentation::UnicodeSegmentation;

fn main() {
    let graphemes = "<p>\u{200D}</p>".graphemes(true).collect::<Vec<_>>();
    assert_eq!(graphemes, ["<", "p", ">\u{200d}", "<", "/", "p", ">"]);
}

It is very hard to work correctly with unicode, and it is even more hard to make non-trivial assumptions (like a "grapheme is a character or something like that", or "nothing would be attached to a normal character in a grapheme") :cry:

mexus avatar Dec 26 '23 21:12 mexus