brozzler icon indicating copy to clipboard operation
brozzler copied to clipboard

extra_headers and some CDP logic do not apply to frames

Open TheTechRobo opened this issue 4 months ago • 3 comments

Hi!

I've done some experimenting, and I've found that extra_headers do not seem to apply to cross-origin frames. I tested by setting extra_headers to {"warcprox-meta": """{"warc-prefix": "special-warc"}"""}, brozzling https://thetechrobo.ca/brozzler-iframe-test.html (which has a YouTube embed), and observing what goes into which WARC. special-warc should intuitively contain all requests relating to what is being brozzled. But grepping for WARC-Target-URI (| uniq) shows a different story:

special-warc-20250827035114778-00000-6i5ps4mg.warc
WARC-Target-URI: https://thetechrobo.ca/brozzler-iframe-test.html
WARC-Target-URI: https://thetechrobo.ca/does/not/exist.png
WARC-Target-URI: https://www.youtube.com/embed/aPg2V5RVh7U
WARC-Target-URI: https://thetechrobo.ca/favicon.ico
WARCPROX-20250827035113470-00000-wrmp3cvu.warc
WARC-Target-URI: http://clients2.google.com/time/1/current?cup2key=9:rgqGXb-a_ZszmhF-iGROG6F-JO_DSPJoG_P-_VgbnpM&cup2hreq=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
WARC-Target-URI: https://safebrowsingohttpgateway.googleapis.com/v1/ohttp/hpkekeyconfig?key=AIzaSyBqJZh-7pA44blAaAkH6490hUFOwX0KCYM
WARC-Target-URI: https://accounts.google.com/ListAccounts?gpsia=1&source=ChromiumBrowser&laf=b64bin&json=standard
WARC-Target-URI: https://www.youtube.com/s/player/6742b2b9/www-player.css
WARC-Target-URI: https://www.youtube.com/s/player/6742b2b9/player_ias.vflset/en_GB/embed.js
WARC-Target-URI: https://fonts.gstatic.com/s/roboto/v18/KFOlCnqEu92Fr1MmEU9fBBc4.woff2
WARC-Target-URI: https://fonts.gstatic.com/s/roboto/v18/KFOmCnqEu92Fr1Mu4mxK.woff2
WARC-Target-URI: https://www.youtube.com/s/player/6742b2b9/www-embed-player.vflset/www-embed-player.js
WARC-Target-URI: https://www.youtube.com/s/player/6742b2b9/player_ias.vflset/en_GB/base.js
WARC-Target-URI: https://www.youtube.com/s/player/6742b2b9/player_ias.vflset/en_GB/remote.js
WARC-Target-URI: https://www.google.com/js/th/z1P_mE5apSVCd16CrsEwj7UAJuHEPotZNGO7bYrdVCQ.js
WARC-Target-URI: https://i.ytimg.com/vi/aPg2V5RVh7U/default.jpg?v=682ce3a3
WARC-Target-URI: https://yt3.ggpht.com/d2sGw3qXN-qcwvaTBtCDWHXSj_LTcFzwEQpHtma55tFPMlL0x6mLkfIwbQRqxFy5y3idvPFKbpw=s68-c-k-c0x00ffffff-no-rj
WARC-Target-URI: https://jnn-pa.googleapis.com/$rpc/google.internal.waa.v1.Waa/Create
WARC-Target-URI: https://www.gstatic.com/cv/js/sender/v1/cast_sender.js
WARC-Target-URI: https://www.youtube.com/generate_204?T2xb7Q
WARC-Target-URI: https://www.gstatic.com/eureka/clank/139/cast_sender.js
WARC-Target-URI: https://jnn-pa.googleapis.com/$rpc/google.internal.waa.v1.Waa/GenerateIT
WARC-Target-URI: https://play.google.com/log?hasfast=true&authuser=0&format=json
WARC-Target-URI: https://android.clients.google.com/checkin
WARC-Target-URI: https://android.clients.google.com/c2dm/register3
WARC-Refers-To-Target-URI: https://android.clients.google.com/c2dm/register3
WARC-Target-URI: https://android.clients.google.com/c2dm/register3
WARC-Target-URI: https://www.youtube.com/youtubei/v1/log_event?alt=json

Some of this is expected, like the ListAccounts call that the browser is doing on its own. But all of the frame's requisites are here as well. This is likely the same root cause as the caveat with frames in #394, since headers are set per websocket connection (and each frame has its own websocket).

This could be fixed by watching for new frames and reconfiguring them as they pop up. But that would likely need some refactoring with the websocket thread (to allow for multiple?), and we'd probably miss some requests during the time it takes to find and connect to the new websocket.

I haven't tested it, but this also likely affects the logic inside the websocket thread as well, such as detecting proxy errors, on_request/on_response, console output, etc. Anything originating from a frame won't show up there.

TheTechRobo avatar Aug 27 '25 04:08 TheTechRobo

This appears to be fixable by switching from the page-specific websocket to the global one (at /json/version rather than /json), attaching to the page we want with Target.attachToTarget, and setting auto-attach for subpages with Target.setAutoAttach. Auto-attach allows the waitForDebuggerOnStart parameter which pauses the iframe until we're ready to handle it. I don't know if that would make browser fingerprinting easier, so we might alternatively just want to hope that we can send the new configuration before any requests are made.

I suspect this would also fix #140, because service workers are one kind of target.

Would this change be welcome?

TheTechRobo avatar Aug 28 '25 18:08 TheTechRobo

We could test this change on our staging cluster, if you can push it up, and we all agree it looks promising.

galgeek avatar Aug 28 '25 18:08 galgeek

Note: we did apparently already fix #140 with #142

galgeek avatar Aug 28 '25 18:08 galgeek