selenium-wire
selenium-wire copied to clipboard
Getting blocked quickly by websites.
My proxies seem to be getting blocked by websites after switching to selenium-wire. I am using Linux. Previously, I would use selenium with pyvirtualdisplay and add proxies through a chrome extension. I started to use selenium-wire headless, I changed the User Agent with a request interceptor, and added a proxy to selenium-wire options.
I was very quickly blocked. I can confirm my proxy is blocked because I can no longer do a curl call to the site with it either. Is there a way for sites to detect selenium-wire using its own SSL certificate? or could this be a screen size check in JavaScript since I changed to headless?
Thanks for raising this. I'm not aware of a way for sites to detect the self-signed certificate, but I guess it's possible a mechanism may exist. But yes it's possible that some sort of digital fingerprinting is happening which checks the screen size - as you mention.
Just want to double-check a couple of other things: with changing the user-agent using a request interceptor, are you deleting the existing user-agent header first before replacing it? Otherwise you'll get two user-agent headers being sent, which may trigger the site block. Also, if you run with regular Chrome (as in non-headless mode), do you get blocked?
Yes, I did delete the previous user agent. I have ran with these proxies for a long time with selenium and I was blocked eventually with a set of proxies. But as I stated, it took no time at all for them to block these. It's hard to say if the switch to headless or the difference in selenium vs selenium-wire catches the request is the issue since I made both changes at the same time. It is hard to test because the site has flagged the proxy IPs completely and I can now no longer hit their site from them in any way. I would need access to multiple proxies that I can just throw away in order to test different scenarios.
@ksmeeks0001 websites can see that you use selenium-wire. I'm getting blocks via CloudFlare protection when using it. While via the extension it works fine. I've tried headless mode, without headless mode - not helped. So the only way is to use the extension. But with the extension Chrome, no idea how to intercept XHR requests while without proxies selenium-wire can intercept everything. In my case, it intercepts only Google ReCaptcha requests and nothing else. I've tried time.sleep, input() not helped.
I'm not 100% sure exactly what in Selenium Wire is causing websites to trigger anti-bot measures, however there's a new bot-detection feature in version 4.1.1 which might be worth a try if you're still having issues. It's experimental at this stage, but I'll look at refining it based on feedback.
Some further info here. It seems that websites can in some cases detect that you are using Selenium Wire, even if you're using a browser implemented with measures to evade bot detection.
When you use Selenium Wire with capture switched on (the default) what actually happens is Selenium Wire fools the browser into thinking that it is the target website, and then performs it's own SSL handshake with the real website to retrieve the content. It does this so that it can sit in the middle and decrypt HTTPS requests and responses as they pass through. But it seems that some websites are able to see from the handshake that the client is not a browser, which triggers anti-bot measures such as throwing up captchas.
One way around this is to disable request capture in Selenium Wire using the disable_capture
option as this will also disable HTTPS decryption - allowing requests to pass straight through. Useful if you only care about non-capture related functions such as proxy connectivity, but no use if you actually want to capture requests.
This is a fairly significant problem that may be touching the realms of SSL fingerprinting. I don't have a proper solution as yet, but I'll update if and when I find one. Additional info in #242
Some further info here. It seems that websites can in some cases detect that you are using Selenium Wire, even if you're using a browser implemented with measures to evade bot detection.
When you use Selenium Wire with capture switched on (the default) what actually happens is Selenium Wire fools the browser into thinking that it is the target website, and then performs it's own SSL handshake with the real website to retrieve the content. It does this so that it can sit in the middle and decrypt HTTPS requests and responses as they pass through. But it seems that some websites are able to see from the handshake that the client is not a browser, which triggers anti-bot measures such as throwing up captchas.
One way around this is to disable request capture in Selenium Wire using the
disable_capture
option as this will also disable HTTPS decryption - allowing requests to pass straight through. Useful if you only care about non-capture related functions such as proxy connectivity, but no use if you actually want to capture requests.This is a fairly significant problem that may be touching the realms of SSL fingerprinting. I don't have a proper solution as yet, but I'll update if and when I find one. Additional info in #242
Hi, thanks for your reply. probably yes. I can capture HTTPS traffic via standard chrome extension. like this - https://stackoverflow.com/questions/55582136/how-to-set-proxy-with-authentication-in-selenium-chromedriver-python and I'm using auth proxies. I've not tested proxies without login/pwd. Also, if you want to investigate the problem you can try to open this - http://shop.axs.com/?c=axs&e=49904939&t_locale=en-US it will display Cloudflare protection. and I've used selenium-wire options for proxies. It working fine via my home IP. I can capture requests but not working via auth proxies. so maybe issues with auth proxies?!
Some further info here. It seems that websites can in some cases detect that you are using Selenium Wire, even if you're using a browser implemented with measures to evade bot detection.
When you use Selenium Wire with capture switched on (the default) what actually happens is Selenium Wire fools the browser into thinking that it is the target website, and then performs it's own SSL handshake with the real website to retrieve the content. It does this so that it can sit in the middle and decrypt HTTPS requests and responses as they pass through. But it seems that some websites are able to see from the handshake that the client is not a browser, which triggers anti-bot measures such as throwing up captchas.
One way around this is to disable request capture in Selenium Wire using the
disable_capture
option as this will also disable HTTPS decryption - allowing requests to pass straight through. Useful if you only care about non-capture related functions such as proxy connectivity, but no use if you actually want to capture requests.This is a fairly significant problem that may be touching the realms of SSL fingerprinting. I don't have a proper solution as yet, but I'll update if and when I find one. Additional info in #242
Not sure if this is related, but maybe a possibility to add self signed certificate could help? certain proxies offer it and if client connects through proxy and uses their certificate would help? I've used it before with requests and luminati proxy
Yes good shout. Selenium Wire disables verification of upstream self-signed certificates by default. I'll have a look at reproducing with an upstream proxy that uses a self-signed certificate, but I'll also add that certificate to the local certificate store and see whether that makes any difference.
@wkeeling hello again) I read the issue about mitmproxy and fingerprint, does this mean that at the moment there is no way to bypass Cloudflare?
@rnyPlanet thanks for linking to that issue. Yes that looks to be the cause of this problem. I'll keep an eye on the development of that issue and see how it progresses.
@ultrafunkamsterdam https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/154#issuecomment-835769698 maybe you know about proxies and can improve the library like yours and this one. and it will be a bomb
One way around this is to disable request capture in Selenium Wire using the
disable_capture
option as this will also disable HTTPS decryption - allowing requests to pass straight through. Useful if you only care about non-capture related functions such as proxy connectivity, but no use if you actually want to capture requests.
@wkeeling Even with {"disable_capture":True}, I am unable to visit "https://nowsecure.nl/" and I still see
. I am guessing this should disappear if capture is truely disabled.
@jlplenio the behaviour of the disable_capture
option has changed since my comment was added. It no longer fully disables HTTPS decryption because otherwise upstream proxy functionality won't work.
With the "Not secure" message, have you installed Selenium Wire's root certificate in your browser? That message will disappear when the certificate is installed.
If you're on Linux, you can install the certificate on the command line with:
mkdir -p $HOME/.pki/nssdb
certutil -d sql:$HOME/.pki/nssdb -A -t TC -n "Selenium Wire" -i /path/to/ca.crt
Change /path/to/ca.crt
to the path of the certificate once you've downloaded it.
Thank you, @wkeeling, for the fast and comprehensive response. The certificate worked. The proxy functionality is what I use selenium-wire for, so I will skip undetected-chromedriver for now.
Is there a way to restrict seleniumwire's behaviour so that it doesn't trigger the detection, but still be able to read the response data from a GET/POST request? Similar to how you would be able to read it in the Inspector of the browser.
@arisolt right now there is no way to do it unfortunately. Selenium Wire presents a different TLS fingerprint than a browser due to the way it uses HTTPS interception behind the scenes.
@wkeeling still no solution for this? Using proxies with Seleniumwire triggers CloudFare
For what it's worth, Squid in bump-in-the-wire mode (with a client that has the appropriate CA cert loaded) lets the proxy see the content, and doesn't trigger CloudFlare (with regular UC). You then need to convince the proxy to keep the data somehow (maybe with ICAP? or just different cache config) in a way that you can retrieve it from your testing script.
No progress till now? Selenium wire and undetected chromedriver is the most powerful binding that have to work! 🍻
I' am trying to capture the browsing traffic of several website (on 'github.io') using selenium. The returned page is same for selenium and browse normally. However, the captured traffic (packet lengths) of them is largely different. Also, the traffic packet lengths captured by selenium are relatively fixed. Dose any one has the same question?
09.2022 seleniumwire.uc still not working. The headless mode got detected aswell by cloudflare sadly
@wkeeling I have the same issue when I enter to gcp and aws portals..
I tried with undetected chrome driver but still have the same issue..
Hi, i think this is the tool used to detect the TLS fingerprint https://github.com/cloudflare/mitmengine, it just compares the tls fp expected from a browser with a specific useragent to the one it recieves. I wonder if it can be solved if we can somehow know for which useragent does the TLS fp of Seleniumwire match the most with.
Hi there, my selenium wire is not detected under the normal mode, but when changed to the headless mode it is detected every time by datadome, any solution for this? Thanks.
Hi there, my selenium wire is not detected under the normal mode, but when changed to the headless mode it is detected every time by datadome, any solution for this? Thanks.
Try it with undetected chromedriver , selenium wire has built in support for it
hi, there is there any update into this issue, i am still getting blocked, by cloud flare while using the selenium wire while using proxies. please any solution for this. thank you.
hi, there is there any update into this issue, i am still getting blocked, by cloud flare while using the selenium wire while using proxies. please any solution for this. thank you.
Hi, this won't work using seleniumwire as the proxy changes the signature of the device. So the best way to capture traffic is to use chrome dev tools protocol using undetected chromedriver. Here are the docs https://chromedevtools.github.io/devtools-protocol/