voca_rs
voca_rs copied to clipboard
U+200D (zero-width joiner) breaks the parsing
Long story short:
fn main() {
assert_eq!(voca_rs::strip::strip_tags("<p>\u{200D}</p>after"), "after");
}
Leads to
thread 'main' panicked at src/main.rs:2:5:
assertion `left == right` failed
left: ""
right: "after"
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
I believe it is caused by the following fact:
use unicode_segmentation::UnicodeSegmentation;
fn main() {
let graphemes = "<p>\u{200D}</p>".graphemes(true).collect::<Vec<_>>();
assert_eq!(graphemes, ["<", "p", ">\u{200d}", "<", "/", "p", ">"]);
}
It is very hard to work correctly with unicode, and it is even more hard to make non-trivial assumptions (like a "grapheme is a character or something like that", or "nothing would be attached to a normal character in a grapheme") :cry: