webextensions-examples Make http response example more robust

I've been using WebRequest filtering for quite some time now in one of my plugins, and while it's easy to get started with toy examples, I've found that there is some significant complexity that must be addressed when trying to do any sort of text manipulation in a bit more robust manner. (Some of these deficiencies are discussed in #364 , partially addressed in the MDN docs ) I thought extending this extension would be the right place to start as I would have wished to know these details when I started my plugin.

This extension now demonstrates three important things:

The accumulation of data through multiple calls to .ondata
The decoding of binary data to text in a streaming fashion. (Not necessarily performant, but should be correct.)
Text decoding that tries to respect the page's reported encoding via Content-Type.

As an observation, browser support for two things would clean this up tremendously:

Exposing the detected character encoding on the webRequest details (at least insofar as it is detectable based on headers) would avoid manual, poor detection logic like that I have here.
Adding support for different encodings for TextEncoder so that the original character encoding can be preserved rather than always converting to UTF-8.

At any rate, let me know what you think! I hope this will help provide a bit stronger starting point for others using webRequest.

Nov 05 '20 05:11 wingman-jr-addon

@caitmuenster Here is the PR that I was curious about; you can see the question I posted on Discourse as well if you're curious. Thanks! (wingman-jr-addon and jessetrana are both me here, zephyr on Discourse)

Nov 07 '20 02:11 wingman-jr-addon

Thanks @wingman-jr-addon! Adding this to @Rob--W's queue for review. :)

Nov 10 '20 17:11 caitmuenster

Thanks @Rob--W for taking the time to look at this and provide helpful feedback, and thanks @caitmuenster for getting this going. I haven't gotten much feedback from the community on this aspect of using WebRequest and yet I think it is so vital to any addon trying to filter text. I'll take a look at your comments; I think you might have some better ideas on how to solve some of the problems I'm facing but it may take a round or two of feedback until I get into the right spot. Thanks!

Nov 11 '20 03:11 wingman-jr-addon

@Rob--W I've taken another pass at the code and have closed out the open tasks I set up on the review. Ultimately, I still dislike the code, but I think it's an improvement and thanks for your help. Let me know if it passes muster.

I'd also like to promote to the top level the conversation we had going in the sub-thread about e.g. exposing the detected MIME type. We were saying:

Given both the complexity of determining the charset as well as the need for it to be accurate in order to filter text properly, do you think the browser-predicted charset would ever be a candidate to expose on the details object?

What a good idea. I have thought of it before and posted a feature request at https://bugzilla.mozilla.org/show_bug.cgi?id=1425479 Unfortunately the feature request was denied. If you post a new bug that shows use cases and why alternatives aren't a good fit, the feature request can be re-evaluated.

Also, the link to your original GitHub issue in there has quite valuable implementation details so I'd like link to it as well: https://github.com/Rob--W/open-in-browser/issues/5

I'd like to discuss this issue more. I think - assuming there is enough use of webRequest filtering in general - that there is a strong case for a bug because it is still nigh on impossible to do basic text replacement. I know just enough now of the complexities to understand also why it is such a pain for the API - basically the core question is: at what point is the final MIME type known? I think given the complexity of the process - and the fact that generally the browser wholly owns and controls it - having a scenario where one or more addons are trying to interact with it will be tricky. It's not quite the "halting problem" but I'm pretty sure something to do with MIME type detection could be Turing complete.

As I see it, there are many possible depths of solutions that could work to varying degrees:

The status quo. Push all detection/re-encoding on to the addon. This seems bad because the logic is complicated and duplicates what the browser is already doing - poorly. Coverage: 0%
Provide a solution that does detection solely based on headers - as far as the addon is concerned MIME type would be fixed after that. Also, extend TextEncoder so that it can re-encode to the detected MIME type to avoid requiring Content-Type mutations. This would allow for basically one-line implementations, but would be incorrect for situations where the MIME type is in the meta or is sniffed. Coverage: maybe 98%?
Delay processing by the addon until the content type is sniffed up to the 512 bytes. Unfortunately I think the addon will likely still want to change e.g. Content-Type header based on this, so that sounds like a potential headache for implementation. Otherwise I think something like an onContentTypeSniffed as a new hook (or perhaps as an option for onHeadersReceived) could be tremendously valuable. Yes, it would also cause problems of a sort if multiple addons chained this. So be it. Coverage: maybe 99.5%?
Expose the current MIME type as the browser continues detection as well as the ability to instruct the browser to change MIME type along the way. Complexity? High. Difficulty to implement? Also probably high. Coverage: maybe 99.9%?

So I'd probably lean towards something more in the vein of option 2 or option 3; I think option 3 has the potential to be more problematic from an implementation standpoint.

After pondering the problem for another 3 years, what thoughts have you had on the matter?

Nov 15 '20 23:11 wingman-jr-addon

Also, I did a little research tonight. I searched GitHub for "TextEncoder" and "filterResponseData" and surveyed many of the results. Not exactly the most scientific - and I likely have errors in my hasty analysis - but here are my rough findings:

83 source files were examined. I tried to not do too many duplicates, and avoided e.g. forks of FF unit tests.
3 of the sources seemed to have recognize and fix both decoding and encoding based on charset etc. (uBlock, Scriptlet Doctor, Adgard)
2 other sources had at least partial mitigations for both decoding and encoding. (LibreJS, Header Editor)
2 other sources recognized the issue but did not attempt to fix it.
18 or so seem are direct copy/modifies the http-response example or inherit the .disconnect logic in the first .ondata currently present in this example.
Many sources did not do HTTP status filtering.
Many sources did decode using stream but did not flush in e.g. .onstop

Those are rather abysmal results, but not surprising. AnalyzeWebRequestUsage.zip

This indicates to me that there is a demonstrated need for awareness as well as mechanisms to make some level of correctness easier - character encoding seems like something that should be a minor footnote for the average addon developer.

Nov 16 '20 05:11 wingman-jr-addon

@Rob--W I know this will be testing the memory a little but, based on your comment, are you happy about the example? If you are, I will test and make sure it works as expected and then merge the changes.

Aug 11 '23 18:08 rebloor

@rebloor @Rob--W I saw that activity popped up on this and thought I'd chime in a bit as I've gotten more experience with the issue as my addon has matured. Unfortunately it hasn't translated to the work here (yet?) as I thought this effort was dead. If there's interest I can take another crack at freshening this up.

In short, I now think that probably the best way to handle the issue is to provide a special linked pair of TextDecoder/TextEncoder that hide the detection altogether and allow the developer to just feed in the stream, get a string, and reencode the string. You still need the raw bytes in many cases but additionally for text handling a built-in like this would help greatly.

The big reason for this approach is primarily because I've now seen many of the different methods that can in theory show up actually do get reported as bugs - including more difficult ones like the <meta> approach which requires sniffing the stream. Handling the work to ensure the encoder and decoder stay in sync as the detected type changes is surely something better left to the browser.

(One further note/update: not having TextEncoder support encoding into all the encodings TextDecoder can support is a frustrating limitation as I can't simply sniff the charset, decode, and then encode cleanly - this is part of what would be nice with the TextEncoder pair being linked.)

I'd also be curious to hear how @Rob--W 's thoughts have evolved now over the past 6 years since his original feature request.

Aug 14 '23 06:08 wingman-jr-addon

@rebloor @Rob--W I saw that activity popped up on this and thought I'd chime in a bit as I've gotten more experience with the issue as my addon has matured. Unfortunately it hasn't translated to the work here (yet?) as I thought this effort was dead. If there's interest I can take another crack at freshening this up.

Oops, this fell through the cracks. I'll perform a review once this is freshed up.

(The rest of the text below is part of the ongoing discussion about a feature request for Firefox; you can stop reading if you are only interested in the PR)

In short, I now think that probably the best way to handle the issue is to provide a special linked pair of TextDecoder/TextEncoder that hide the detection altogether and allow the developer to just feed in the stream, get a string, and reencode the string. You still need the raw bytes in many cases but additionally for text handling a built-in like this would help greatly.

This sounds very convenient from the extension developer's point of view, but a huge burden to the extension framework. Due to content sniffing, the target encoding may sometimes be dependent on the data itself (e.g. binary vs something else). And sometimes there is no lossless way to encode one in another.

The big reason for this approach is primarily because I've now seen many of the different methods that can in theory show up actually do get reported as bugs - including more difficult ones like the <meta> approach which requires sniffing the stream. Handling the work to ensure the encoder and decoder stay in sync as the detected type changes is surely something better left to the browser.

I suppose that you are primarily interested in replacing specific parts of the content without changing the encoding. When the byte sequence is unambiguous (the keyword and replacement), you could also operate on the bytes directly, without using TextEncoder/TextDecoder.

(One further note/update: not having TextEncoder support encoding into all the encodings TextDecoder can support is a frustrating limitation as I can't simply sniff the charset, decode, and then encode cleanly - this is part of what would be nice with the TextEncoder pair being linked.)

Non-utf8 has little utility in web browsers, and having encoders for non-utf8 encodings would bloat the implementation and result in a larger application binary, as well as a larger attack surface. So while I understand the convenience to developers again, I also recognize the issue with having this functionality natively. Note - if you really want to there is of course nothing stopping you from having encoders the other way around again, but that's probably not optimal.

I'd also be curious to hear how @Rob--W 's thoughts have evolved now over the past 6 years since his original feature request.

On https://bugzilla.mozilla.org/show_bug.cgi?id=1425479 ? When I filed that issue I was an external contributor with experience as an extension developer, and browser engineering in Firefox and Chromium (as a volunteer). Now I have been working for a few years as a browser engineer at Mozilla, and one thing that stands out is that there are significantly more ideas than time available to implement them all. Therefore even if an idea on its own seems useful, it may not make the cut due to prioritization. Especially if the feature is niche but requires considerable engineering effort and maintenance.

Sep 30 '23 22:09 Rob--W

@wingman-jr-addon please ping Rob when you've had the opportunity to address his feedback and make any other changes.

Oct 01 '23 20:10 rebloor