matroska
matroska copied to clipboard
Should String Elements and UTF-8 Elements be handled differently?
EBML differentiates between String Elements and UTF-8 Elements.
They are almost exactly the same, except for this:
- "String" only allows ASCII and terminator bytes.
- "UTF-8" only allows UTF-8 and terminator bytes.
Since UTF-8 is compatible with ASCII, we can just treat both Elements like UTF-8 without any parsing failures. And it is what the library currently does: https://github.com/rust-av/matroska/blob/7f15b7c9c6e7d2ad0749da053b77f051c5e4e146/src/ebml/parse.rs#L80-L84
However, this is technically not spec-compliant as we should reject non-ASCII (or terminator) values if we have a String Element, while they may be allowed for UTF-8 Elements.
A small overview:
| String = UTF-8 | String != UTF-8 |
|---|---|
only String |
String + new type |
from_utf8 (std) |
from_utf8 (std) + from_ascii (custom, maybe faster?) |
| mostly compliant | compliant |
Honestly, I can't think of many advantages to implementing this. We won't even "save" any memory because Rust strings are all UTF-8 internally (so a String made up of only ASCII will only use one byte per character regardless). But I wanted to at least put this information somewhere. What do you think?
Accepting is nearly ok, writing on the other hand is to be avoided if we had a strict mode.
@robUx4 do you have opinions? What do you do in the reference implementation?
Just my opinion, but I prefer being compliant with the specification in order to have a clear and unanmbiguous implementation.
Is a big effort converting bytes into UTF-8 or ASCII in terms of memory and time consumption?
Perhaps I'm wrong, but we need these conversions to write a compliant matroska file. Since we have to do that for muxing can't we apply the same ones for parsing?
You do not have to convert anything, you have to reject non-ascii if your target is an ascii string.
You do not have to convert anything, you have to reject non-ascii if your target is an ascii string.
Is this spec compliant? If we find an ascii string I suppose we should save it somewhere. If it is non-ascii, when there should be an ascii string, doesn't it mean that we have a corrupted input?
Just posing questions to see if I've understood the problem correctly
Given that ascii and utf8 have the same representation for what's in ascii, in order to produce always valid files we have to reject non-ascii-for-ascii or put a placeholder instead of an utf-8 character depending on how lenient we want to be.
On consumption, I would only reject non-utf8 no matter the source.
You should not be able to write UTF-8 data in an ASCII string. When reading you may also want to check the string is really ASCII, although I'm not aware of any reader doing so. However since this is Rust it may not be nice to pass around UTF-8 data pretending to the ASCII.
As a note to the person implementing this (maybe me soon):
- ASCII strings could be represented by
Vec<u8>or a type like this, probably not worth the extra complexity (at least for now)- this would require validation on the parsing side, but (given some prerequisites) not on the serialization side
- negligible memory efficiency improvement (nothing when
String/&str; a bit when iterating overchars) - maybe more CPU-efficient?
- Serialization of EBML ASCII Element must check validity if Rust string types are used (maybe
char::is_ascii(self)? or alternatively, iterate over.bytes()and check manually) - Rusts UTF-8 types don't need to be validated for EBMLs UTF-8 type. Strings can only be invalid UTF-8 if the caller "messes up" (unsafe code + broken prerequisite). std assumes valid UTF-8 in any string, so we should be able to assume that too.
@FreezyLemon I agree with your analysis