open-in-browser
open-in-browser copied to clipboard
Content sniffing implementation details
Last month I spent two weeks on implementing content sniffing, which was behaviorally identical to Firefox's implementation. Unfortunately, I lost the laptop before I pushed the changes, so I will document what's necessary in case anyone (maybe me?) is interested in implementing a content sniffer.
The full implementation (code and comments) consisted of about 3 - 5k lines of JS code (unit tests were written but not included in this count).
The implementation details are as follows (this is a brain dump from my recollection):
- The new
webRequest.filterResponseDataAPI can be used to inspect and modify the response body. This filter is activated after thewebRequest.onHeadersReceivedevent stage, for http(s) only. There are several bugs, see the list of bugs that I appended to the bug that introduced this new webRequest method : https://bugzilla.mozilla.org/show_bug.cgi?id=1255894#a48785057_447061 - Content sniffing happens in two stages (much more details below):
- At first entries in the
NS_CONTENT_SNIFFER_CATEGORY(aka"net-content-sniffers") category are used to estimate the MIME type. - If unknown, then basically the logic of
nsUnknownDecoder::DetermineContentTypeis used (which includes entries from theNS_DATA_SNIFFER_CATEGORY(aka"content-sniffing-services") category.
- At first entries in the
- The extension can force a specific content type after the
onHeadersReceivedby using thewebRequest.filterResponseDatato change the response body. For some types, prepending magic bytes can be done in a transparent way (e.g. HTML and plain text). For others, the response can be forced to HTML that in turn embeds a full-page iframe that requests the original URL (with cache buster). The extension can then intercept this request and pipe the original response to this new request. The reason for using an iframe is to ensure that the original response stream is not aborted. If the original response is not important, redirecting would work too. - Basically, Firefox follows the following logic to determine what to do with a givien response body
- Extract the MIME type from the
Content-Typeheader.- Implementation: https://searchfox.org/mozilla-central/rev/8a6a6bef7c54425970aa4fb039cc6463a19c0b7f/netwerk/base/nsURLHelper.cpp#978-1030
- If the MIME is not set or an empty string, treat it as "application/x-unknown-content-type" and continue at the next bullet point.
- If the MIME is supported by Firefox, display inline and don't sniff (follow the logic at
nsDocumentOpenInfo::DispatchContentas I mentioned at )https://github.com/Rob--W/open-in-browser/issues/1#issuecomment-331710653)- Exception: for the
text/plain,application/octet-streamandapplication/x-unknown-content-typeMIME types, Firefox MAY activate content sniffing, and open a download dialog even if the content would otherwise be displayed inline (text/plain), or display the content inline even though the content usually triggers a download dialog (application/octet-stream).
- Exception: for the
- If the MIME is not recognized by Firefox, open a download dialog.
- If the MIME is
application/octet-streamorapplication/x-unknown-content-type, perform media sniffing:- Implementation: https://searchfox.org/mozilla-central/rev/8a6a6bef7c54425970aa4fb039cc6463a19c0b7f/toolkit/components/mediasniffer/nsMediaSniffer.cpp#141-210
- Note: If a document was sniffed as media, Firefox will immediately switch to a document, and the
webRequest.filterResponseDatamethod can NOT be used to modify the response stream. To replace the document, you must run a content script in this new media document.
- If the MIME is
text/html,application/octet-streamor containing "xml", then the feed sniffer is activated.- Implementation: https://searchfox.org/mozilla-central/rev/8a6a6bef7c54425970aa4fb039cc6463a19c0b7f/browser/components/feeds/nsFeedSniffer.cpp#206-336
- Note: I did not implement this because of the rare conditions, and the fact that the type was already inline (I only need to implement content sniffing if the type is potentially going to display a download dialog, since Open in Browser is only relevant for that situation).
- If the
Content-Typeis a case-sensitive match fortext/plain,text/plain; charset=ISO-8859-1,text/plain; charset=iso-8859-1ortext/plain; charset=UTF-8, AND theContent-Encodingrequest header is NOT set, then the sniffer will either force a download dialog or display inline:- Implementation: https://searchfox.org/mozilla-central/rev/091894faeac5b54b7e40b0a304c3d3268f7b645d/netwerk/streamconv/converters/nsUnknownDecoder.cpp#895-943
- Basically, if starting with an unicode BOM, or the first 512 bytes (or less if the response ends early) only consists of text characters: Treat as text. Otherwise
application/octet-stream= download dialog.- Implementation: https://searchfox.org/mozilla-central/rev/091894faeac5b54b7e40b0a304c3d3268f7b645d/netwerk/streamconv/converters/nsUnknownDecoder.cpp#666-714
- If the MIME is
"application/x-unknown-content-type"(or empty, as mentioned before), sniff magic bytes.- Implementation: https://searchfox.org/mozilla-central/rev/091894faeac5b54b7e40b0a304c3d3268f7b645d/netwerk/streamconv/converters/nsUnknownDecoder.cpp#434-530
- Basically, the MIME is found in the following order:
- Look at magic bytes.
- Call the sniffers in the
NS_DATA_SNIFFER_CATEGORY(aka"content-sniffing-services") category- Media sniffer - https://searchfox.org/mozilla-central/rev/8a6a6bef7c54425970aa4fb039cc6463a19c0b7f/toolkit/components/mediasniffer/nsMediaSniffer.cpp#141-210 (complicated - magic bytes and structure parsing)
- Image sniffer - https://searchfox.org/mozilla-central/rev/8a6a6bef7c54425970aa4fb039cc6463a19c0b7f/image/imgLoader.cpp#2646-2701 (simple - magic bytes only)
- Try HTML sniffing.
- Try sniffing from the URL.
- Fall back to the same method as
text/plainsniffing (which would result intext/plainorapplication/octet-stream).
- Extract the MIME type from the
Other notes relevant for the implementation:
- Content sniffing relies on up to 512 bytes of data, but the media sniffer may try to use more if available.
- At least for text and HTML, Firefox will only display the response after 512 bytes of data have been written (or 1024, I don't remember).
- For images and media, Firefox will switch to a special image/media document upon detecting the type (typically via magic bytes; for media sniffer more than magic bytes).
- There is a draft for a specification at https://mimesniff.spec.whatwg.org/. This specification is close to Firefox's content sniffing. It does have any mention of media sniffing for
application/octet-stream, and neither mentions the specialapplication/x-unknown-content-type(this MIME is an artefact of Firefox's implementation; internally it represents the default value for a MIME type in a HTTP channel). - Character encoding should be respected/supported. For text/plain the UTF-8 and UTF-16 BOM can be used. For text/html, the content can be transcoded via the TextDecoder/TextEncoder APIs (except for UTF-16, which should not be used for HTML anyway).
Bugs in the webRequest.filterResponseData API that I haven't reported upstream (yet?):
- If the
Content-Typeisapplication/x-unknown-content-typeand the response is content-encoded, then the filtered response must also be encoded using the same type (e.g. gzipped) (for other types, e.g.text/html, the encoding is transparent, i.e. the value of theContent-Encodingheader does not matter). The easiest way around this is to remove theAccept-Encodingrequest header or theContent-Encodingresponse header (or set it to "identity"). The more difficult way to get around this is to implement gzipping (and possibly other (obscure) encoding schemes such as deflate/brotli). - If a
StreamFilteris closed, Firefox will always commit a navigation to a new document, even if no data was written to thatStreamFilter, and even if the tab/frame has navigated to a different page. The only work-around that I could think of is to keep theStreamFilteropen forever (yuck).
Can you look here? https://bugzilla.mozilla.org/show_bug.cgi?id=1287264
@def00111 I looked (and I filed a new feature request at https://bugzilla.mozilla.org/show_bug.cgi?id=1425479). Why did you want me to look at that bug?
Why did you want me to look at that bug?
I just want to have you look at this bug :)
Maybe, we can also expose nsIChannel.contentDispositionFilename [1]?
[1] https://dxr.mozilla.org/mozilla-central/rev/2386800ec051598ff4dd42da1118abcf05299fc1/netwerk/base/nsIChannel.idl#327
I also have another idea. Can we add the download [1] attribute value to webRequest.onBeforeRequest details [2]? To get the filename from download attribute? Like with Content-Disposition header in webRequest.onHeadersReceived [3]?
Look here please: https://github.com/def00111/always-preview/blob/master/content.js
[1] https://developer.mozilla.org/en-US/docs/Web/HTML/Element/a#attr-download [2] https://developer.mozilla.org/en-US/Add-ons/WebExtensions/API/webRequest/onBeforeRequest#details [3] https://developer.mozilla.org/en-US/Add-ons/WebExtensions/API/webRequest/onHeadersReceived
Maybe, we can also expose nsIChannel.contentDispositionFilename [1]?
This extension is a very specialized use case. While having such a property would make the life of me as an extension developer easier, I don't think that that convenience outperforms the maintenance cost of exposing the info through the webRequest extension API. Especially since it can fully be implemented in JavaScript with minimal performance impact - https://github.com/Rob--W/open-in-browser/blob/05b80a3ce151737cfc7735eb1a714dfa84f3e3a5/extension/content-disposition.js
Can we add the download [1] attribute value to webRequest.onBeforeRequest details [2]?
This, on the other hand, could be a good reason to support the API enhancement. But...:
<a download> does not work for cross-origin resources, only same-origin resources. Furthermore, <a download> is more commonly ysed for JS-generated content (blob:/data:-URLs), which is not intercepted by my extension. So the value of an accessor for the value of <a download> is limited.
In the case of <a download> to a same-origin resource without Content-Disposition response header (which I presume is rare), users can just open the link in a new tab to get the dialog if they want to view it inline or trigger an Open in Browser dialog). In the worst case (e.g. if the link is not visible), then they can use the extension menu in the Tools menu to force the dialog to appear anyway.
I appreciate your comments, but I'd like to keep the comments here on-topic. If you have more to say (unrelated to content sniffing), please open a new issue or continue via e-mail.
is more commonly ysed for JS-generated content (blob:/data:-URLs), which is not intercepted by my extension. So the value of an accessor for the value of is limited.
This page: https://atpscan.global.hornetsecurity.com/safe_download.php?uri=aHR0cHM6Ly93d3cuc3dwLWJlcmxpbi5vcmcvZmlsZWFkbWluL2NvbnRlbnRzL3Byb2R1Y3RzL2FrdHVlbGwvMjAxNUEwM193Z24ucGRm&cd=MjAxNWEwM193Z24ucGRm&type=dat
Can i use content-disposition.js [1] in my add-on?
[1] https://github.com/Rob--W/open-in-browser/blob/05b80a3ce151737cfc7735eb1a714dfa84f3e3a5/extension/content-disposition.js
Is this the same what firefox does?
Can i use content-disposition.js [1] in my add-on?
Yes. When you add a commit in your repo, do link back to the original source in the commit description. Then in the future it will be easier for others to check whether the implementation is still up-to-date.
Is this the same what firefox does?
Yes, except for a few cases of malformed response headers (I don't think that you will ever find these in the wild). See the commit description and unit tests from https://github.com/Rob--W/open-in-browser/commit/6f3bbb8bbfc1e3e943200fffdb68d35075e82ddd