matroska icon indicating copy to clipboard operation
matroska copied to clipboard

Should String Elements and UTF-8 Elements be handled differently?

Open FreezyLemon opened this issue 2 years ago • 8 comments

EBML differentiates between String Elements and UTF-8 Elements.

They are almost exactly the same, except for this:

  • "String" only allows ASCII and terminator bytes.
  • "UTF-8" only allows UTF-8 and terminator bytes.

Since UTF-8 is compatible with ASCII, we can just treat both Elements like UTF-8 without any parsing failures. And it is what the library currently does: https://github.com/rust-av/matroska/blob/7f15b7c9c6e7d2ad0749da053b77f051c5e4e146/src/ebml/parse.rs#L80-L84

However, this is technically not spec-compliant as we should reject non-ASCII (or terminator) values if we have a String Element, while they may be allowed for UTF-8 Elements.

A small overview:

String = UTF-8 String != UTF-8
only String String + new type
from_utf8 (std) from_utf8 (std) + from_ascii (custom, maybe faster?)
mostly compliant compliant

Honestly, I can't think of many advantages to implementing this. We won't even "save" any memory because Rust strings are all UTF-8 internally (so a String made up of only ASCII will only use one byte per character regardless). But I wanted to at least put this information somewhere. What do you think?

FreezyLemon avatar Apr 27 '23 18:04 FreezyLemon

Accepting is nearly ok, writing on the other hand is to be avoided if we had a strict mode.

@robUx4 do you have opinions? What do you do in the reference implementation?

lu-zero avatar Apr 27 '23 18:04 lu-zero

Just my opinion, but I prefer being compliant with the specification in order to have a clear and unanmbiguous implementation.

Is a big effort converting bytes into UTF-8 or ASCII in terms of memory and time consumption?

Perhaps I'm wrong, but we need these conversions to write a compliant matroska file. Since we have to do that for muxing can't we apply the same ones for parsing?

Luni-4 avatar Apr 27 '23 21:04 Luni-4

You do not have to convert anything, you have to reject non-ascii if your target is an ascii string.

lu-zero avatar Apr 27 '23 23:04 lu-zero

You do not have to convert anything, you have to reject non-ascii if your target is an ascii string.

Is this spec compliant? If we find an ascii string I suppose we should save it somewhere. If it is non-ascii, when there should be an ascii string, doesn't it mean that we have a corrupted input?

Just posing questions to see if I've understood the problem correctly

Luni-4 avatar Apr 28 '23 06:04 Luni-4

Given that ascii and utf8 have the same representation for what's in ascii, in order to produce always valid files we have to reject non-ascii-for-ascii or put a placeholder instead of an utf-8 character depending on how lenient we want to be.

On consumption, I would only reject non-utf8 no matter the source.

lu-zero avatar Apr 28 '23 08:04 lu-zero

You should not be able to write UTF-8 data in an ASCII string. When reading you may also want to check the string is really ASCII, although I'm not aware of any reader doing so. However since this is Rust it may not be nice to pass around UTF-8 data pretending to the ASCII.

robUx4 avatar May 02 '23 05:05 robUx4

As a note to the person implementing this (maybe me soon):

  • ASCII strings could be represented by Vec<u8> or a type like this, probably not worth the extra complexity (at least for now)
    • this would require validation on the parsing side, but (given some prerequisites) not on the serialization side
    • negligible memory efficiency improvement (nothing when String/&str; a bit when iterating over chars)
    • maybe more CPU-efficient?
  • Serialization of EBML ASCII Element must check validity if Rust string types are used (maybe char::is_ascii(self)? or alternatively, iterate over .bytes() and check manually)
  • Rusts UTF-8 types don't need to be validated for EBMLs UTF-8 type. Strings can only be invalid UTF-8 if the caller "messes up" (unsafe code + broken prerequisite). std assumes valid UTF-8 in any string, so we should be able to assume that too.

FreezyLemon avatar Jun 10 '23 16:06 FreezyLemon

@FreezyLemon I agree with your analysis

Luni-4 avatar Jun 11 '23 10:06 Luni-4