brozzler icon indicating copy to clipboard operation
brozzler copied to clipboard

Use global browser WebSocket instead of the page-specific one

Open TheTechRobo opened this issue 3 months ago • 2 comments

This PR moves from the target-specific websocket to the global websocket, attaching to the page using Target.attachToTarget.

By enabling auto attach we are able to control cross-origin frames. This means we can see network logs, inject the stealth code, and log console messages. It also means extra_headers will work correctly, vital if using warcprox-meta.

I have created a quick and dirty script to show the difference. Assumes warcprox is running on port 8000 to test the custom headers. (There's a minor typo where it says your CPU has X cores, when it should say threads. I didn't want to redo all the tests to fix that.) https://transfer.archivete.am/inline/7D0pv/brozzler_pr_test.py

You can run it to test same-origin frames with brozzler_pr_test.py https://thetechrobo.ca/brozzler_session_test/same-origin.html, and cross-origin with brozzler_pr_test.py https://thetechrobo.ca/brozzler_session_test/cross-origin.html.

For convenience, here are the results I get:

same-origin, master branch

image
Outer:
Got log message

Inner:
Got log message

Network requests:
693FB8B9DEC4B47E22D3C9BB5717E11B https://thetechrobo.ca/brozzler_session_test/same-origin.html 
333229.2 https://thetechrobo.ca/brozzler_session_test/script.js 
04013371C4B6379B3F61C3FDB284907F https://thetechrobo.ca/brozzler_session_test/frame.html 
333229.5 https://thetechrobo.ca/brozzler_session_test/script.js 
333229.7 https://thetechrobo.ca/brozzler_session_test/a.file 
333229.8 https://thetechrobo.ca/favicon.ico

and all of those requests can be found in the correct WARC file:

warcs/the-prefix-20251008020531642-00000-6vaowz3t.warc
WARC-Target-URI: https://thetechrobo.ca/brozzler_session_test/same-origin.html
WARC-Target-URI: https://thetechrobo.ca/brozzler_session_test/script.js
WARC-Target-URI: https://thetechrobo.ca/brozzler_session_test/frame.html
WARC-Target-URI: https://thetechrobo.ca/brozzler_session_test/a.file
WARC-Target-URI: https://thetechrobo.ca/favicon.ico

warcs/WARCPROX-20251008020529700-00000-6tk1anvb.warc
WARC-Target-URI: http://clients2.google.com/time/1/current?cup2key=9:78Ecu_Au6Bozk0IWiZHvoW8jIlj3ZbY-YVnjPmV8W4Y&cup2hreq=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
WARC-Target-URI: https://safebrowsingohttpgateway.googleapis.com/v1/ohttp/hpkekeyconfig?key=AIzaSyBqJZh-7pA44blAaAkH6490hUFOwX0KCYM
WARC-Target-URI: https://www.google.com/async/folae?async=_fmt:pb
WARC-Target-URI: https://accounts.google.com/ListAccounts?gpsia=1&source=ChromiumBrowser&laf=b64bin&json=standard
WARC-Target-URI: https://android.clients.google.com/checkin
WARC-Target-URI: https://android.clients.google.com/c2dm/register3

cross-origin, master branch

image

Stealth code doesn't work, nor does our custom user agent.

Outer:
Got log message

Inner:
Didn't get log message

Network requests:
925EBC7C95C891B5810248CC6D27FD70 https://thetechrobo.ca/brozzler_session_test/cross-origin.html 
334929.2 https://thetechrobo.ca/brozzler_session_test/script.js 
43286E4AC78370643428C277B27428C0 https://very-good-quality-co.de/brozzler_session_test/frame.html (NO RESPONSE)
334929.4 https://thetechrobo.ca/favicon.ico 

We never see the requests originating from the frame. We also are never even told that the frame finished loading, which is why _should_track_request exists.

Finally, requests originating from the frame are put in the wrong WARC, as Warcprox-Meta isn't sent:

warcs/the-prefix-20251008015640720-00000-fv68t1gw.warc
WARC-Target-URI: https://thetechrobo.ca/brozzler_session_test/cross-origin.html
WARC-Target-URI: https://thetechrobo.ca/brozzler_session_test/script.js
WARC-Target-URI: https://very-good-quality-co.de/brozzler_session_test/frame.html
WARC-Target-URI: https://thetechrobo.ca/favicon.ico

warcs/WARCPROX-20251008015639072-00000-yb2xu0z6.warc
WARC-Target-URI: http://clients2.google.com/time/1/current?cup2key=9:F22cMXtPOOZ5wf3kI6v2hJe_R8UyJI71pUz-_o-pTB8&cup2hreq=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
WARC-Target-URI: https://safebrowsingohttpgateway.googleapis.com/v1/ohttp/hpkekeyconfig?key=AIzaSyBqJZh-7pA44blAaAkH6490hUFOwX0KCYM
WARC-Target-URI: https://www.google.com/async/folae?async=_fmt:pb
WARC-Target-URI: https://accounts.google.com/ListAccounts?gpsia=1&source=ChromiumBrowser&laf=b64bin&json=standard
WARC-Target-URI: https://very-good-quality-co.de/brozzler_session_test/script.js
WARC-Target-URI: https://very-good-quality-co.de/brozzler_session_test/a.file
WARC-Target-URI: https://android.clients.google.com/checkin
WARC-Target-URI: https://android.clients.google.com/c2dm/register3

same-origin, this branch

As expected with same-origin, everything still works:

image
Outer:
Got log message

Inner:
Got log message

Network requests:
ABA02B930CA6726E6884CA2DFFC164F9 https://thetechrobo.ca/brozzler_session_test/same-origin.html 
337499.2 https://thetechrobo.ca/brozzler_session_test/script.js 
31E708C3FEA7206623BE266EFB27954E https://thetechrobo.ca/brozzler_session_test/frame.html 
337499.5 https://thetechrobo.ca/brozzler_session_test/script.js 
337499.7 https://thetechrobo.ca/brozzler_session_test/a.file 
337499.8 https://thetechrobo.ca/favicon.ico 
warcs/the-prefix-20251008020739039-00000-rk5pbe6x.warc
WARC-Target-URI: https://thetechrobo.ca/brozzler_session_test/same-origin.html
WARC-Target-URI: https://thetechrobo.ca/brozzler_session_test/script.js
WARC-Target-URI: https://thetechrobo.ca/brozzler_session_test/frame.html
WARC-Target-URI: https://thetechrobo.ca/brozzler_session_test/a.file
WARC-Target-URI: https://thetechrobo.ca/favicon.ico

warcs/WARCPROX-20251008020736744-00000-4v67bzwx.warc
WARC-Target-URI: http://clients2.google.com/time/1/current?cup2key=9:H71F5eVlx36Ivqa_i9iFAC3tcC7rLIH0j95LbWN-5N8&cup2hreq=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
WARC-Target-URI: https://safebrowsingohttpgateway.googleapis.com/v1/ohttp/hpkekeyconfig?key=AIzaSyBqJZh-7pA44blAaAkH6490hUFOwX0KCYM
WARC-Target-URI: https://www.google.com/async/folae?async=_fmt:pb
WARC-Target-URI: https://accounts.google.com/ListAccounts?gpsia=1&source=ChromiumBrowser&laf=b64bin&json=standard
WARC-Target-URI: https://android.clients.google.com/checkin
WARC-Target-URI: https://android.clients.google.com/c2dm/register3

cross-origin, this branch

And cross origin frames do too!

image
Outer:
Got log message

Inner:
Got log message

Network requests:
D51F94F92AA894A5DF39AE80F1ADF60A https://thetechrobo.ca/brozzler_session_test/cross-origin.html 
337999.2 https://thetechrobo.ca/brozzler_session_test/script.js 
E6CF6612D0E4794334F06831929FE9CF https://very-good-quality-co.de/brozzler_session_test/frame.html 
338123.2 https://very-good-quality-co.de/brozzler_session_test/script.js 
338123.4 https://very-good-quality-co.de/brozzler_session_test/a.file 
337999.4 https://thetechrobo.ca/favicon.ico 
warcs/the-prefix-20251008020932892-00000-eimh02ng.warc
WARC-Target-URI: https://thetechrobo.ca/brozzler_session_test/cross-origin.html
WARC-Target-URI: https://thetechrobo.ca/brozzler_session_test/script.js
WARC-Target-URI: https://very-good-quality-co.de/brozzler_session_test/frame.html
WARC-Target-URI: https://very-good-quality-co.de/brozzler_session_test/script.js
WARC-Target-URI: https://thetechrobo.ca/favicon.ico
WARC-Target-URI: https://very-good-quality-co.de/brozzler_session_test/a.file

warcs/WARCPROX-20251008020930598-00000-hsvwlx8b.warc
WARC-Target-URI: http://clients2.google.com/time/1/current?cup2key=9:Eg30dEYPnaFEWAMmHdxbG4t3skPvMb4Sqds_jUShGrg&cup2hreq=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
WARC-Target-URI: https://safebrowsingohttpgateway.googleapis.com/v1/ohttp/hpkekeyconfig?key=AIzaSyBqJZh-7pA44blAaAkH6490hUFOwX0KCYM
WARC-Target-URI: https://www.google.com/async/folae?async=_fmt:pb
WARC-Target-URI: https://accounts.google.com/ListAccounts?gpsia=1&source=ChromiumBrowser&laf=b64bin&json=standard
WARC-Target-URI: https://android.clients.google.com/checkin
WARC-Target-URI: https://android.clients.google.com/c2dm/register3

Notes:

  • A lot of stuff has been refactored. I tried to keep things backward compatible, but if I missed anything please let me know.
  • Another lock was necessary, this one in the Counter class, because Python still somehow doesn't have an atomic increment.
    • I feel bad doing this, but it is only rarely locked by the websocket thread, and uncontented locks are apparently basically free.
  • Because we now properly receive a response for the frame's request, we can get rid of the _should_track_request function.
  • The Target.attachedToTarget event we rely on is technically labelled experimental. But it's been labelled that way since 2020.
  • configure_browser currently waits for a response from the browser when setting up stealth. It's been removed, since doing that from the websocket thread would cause a deadlock. Is it actually necessary?

Please let me know if you have any questions or comments!

Fixes #406

TheTechRobo avatar Oct 08 '25 02:10 TheTechRobo

Hi! Is there anything I can do to make this PR ready for merge? I don't currently have a lot of time to work on this, but I should be able to work in any needed changes at some point.

Thank you!

TheTechRobo avatar Nov 29 '25 00:11 TheTechRobo

Sorry about the delay reviewing this. I'll review on Monday!

mistydemeo avatar Nov 29 '25 01:11 mistydemeo