mimesniff icon indicating copy to clipboard operation
mimesniff copied to clipboard

Signature for WebM algorithm is unclear

Open rtarnaud opened this issue 5 years ago • 7 comments

Hi,

This algorithm requires to parse a vint (would be nice if you could insert a link describing what a vint actually is) for which you provide a sub-algorithm. However, at step 4.1, it is unclear to what the 'sequence[index]' refers to. What is 'index' in this context?

Thanks

rtarnaud avatar Nov 18 '18 15:11 rtarnaud

There appear to be more problems with the algorithm.

First, step 6.1.5 of the matches the signature for WebM function states:

If iter is less than length - 4, abort these steps.

However, iter can never be more than 38, and length can be as much as 1445. This suggests that "less than" should actually read "greater than" (or possibly "greater than or equal to" given the text of the matching a padded sequence function).

Next, there is the parse a vint function's use of index before it is defined, as @rtarnaud points out. I assume this should actually be iter, as referenced in the function signature definition. The parse a vint function as a whole is unclear, though, as to whether it operates on the same sequence as the matches the signature for WebM function, or operates on a substring starting at iter, given that index is initially set at 0 rather than the value of iter (not to mention the check that number size is less than length in step 4). The signature definition would suggest it's the same sequence, but the steps appear to assume a substring.

Finally, the matching a padded sequence function states:

... eventually preceded by bytes with a value of 0x00 ...

I'm not even sure what "eventually" means in terms of parsing. Why "preceded"? The "Big Buck Bunny" sample WebM video doesn't seem to have any null bytes preceding the webm bytes. The definition of the function would seem to be plain wrong, but I haven't been able to make enough sense of EBML/Matroska/WebM to tell what the right definition is. Also, why null bytes? How many?

JKingweb avatar Apr 21 '20 16:04 JKingweb

FWIW, I consider section 5 onward as in need of work in terms of tests, ensuring the algorithms are correct and match implementations, and also to use the newer terminology from https://infra.spec.whatwg.org/. All of it is quite old and doesn't really meet the criteria we place on text today: https://whatwg.org/working-mode#changes. Unfortunately other things keep coming up that seem somewhat more pressing, but if someone was interested in this kind of work I'd be happy to assist.

annevk avatar Apr 21 '20 17:04 annevk

For context: This algorithm was added in #3 by @padenot.

(Not sure why the date on the commit is 3 years before the date on the PR.)

GPHemsley avatar Apr 29 '20 16:04 GPHemsley

I am trying to implement a version of the algorithm and I am somewhat confused as to what its implementation should be and some of the concepts in it. Do you guys have any suggestions as to what resource i could consult to maybe make this a bit clearer?

velezbeltran avatar Jul 31 '20 13:07 velezbeltran

So WebM is based on Matroska, which is based on EBML. None of these specs seem to be written with our target audience in mind. (Who the target audience is, I'm not sure.) I am finding it very difficult to find the information I am looking for in any of them, especially without having example files to look at.

The following resources seem to be the most useful for clarifying what is supposed to be in one of these files: https://www.iana.org/assignments/ebml/ebml.xhtml https://matroska-org.github.io/libebml/specs.html

These documents, along with points raised in #146, do indeed lead me to believe that there is room for improvement in the WebM sniffing algorithm.

For starters, how much parsing of a WebM/Matroska/EBML file do we actually want to do just to identify that a file should be assigned a video/webm MIME type? The current algorithm sits uncomfortably somewhere between "almost all of it" and "none at all". And it doesn't help that the format has the concept of vint, which requires its own special handling just to find out how many bytes to read (or skip!) next. And if we are going to go through all this effort, should we generalize beyond WebM to other Matroska/EBML formats?

(I'll note that prior to the introduction of the current algorithm in #3, we were only recommending sniffing the 4 bytes representing the EBML signature.)

GPHemsley avatar Aug 29 '21 20:08 GPHemsley

Another question: Why are we sniffing WebM at all? The format was only introduced in 2010, which should have been recent enough to avoid legacy Web issues.

GPHemsley avatar Aug 29 '21 20:08 GPHemsley

For reference, the following two sequences of bytes are detected as WebM in both Firefox and Chrome: 1A 45 DF A3 81 42 82 84 77 65 62 6D 1A 45 DF A3 01 FF FF FF FF FF FF FF 42 86 81 01 42 F7 81 01 42 F2 81 04 42 F3 81 08 42 87 81 01 42 85 81 01 42 82 84 77 65 62 6D

GPHemsley avatar Aug 29 '21 21:08 GPHemsley