opensea-scraper icon indicating copy to clipboard operation
opensea-scraper copied to clipboard

Not working when deployed (Google Cloud). TimeoutError: waiting for selector `.cf-browser-verification` to be hidden failed: timeout 30000ms exceeded

Open mlarcher opened this issue 3 years ago • 4 comments

This is what I get running on GCP using offersByScrolling: TimeoutError: waiting for selector ``.cf-browser-verification`` to be hidden failed: timeout 30000ms exceeded It seems it is sometimes working and sometimes failing on this error. Any idea what's happening there ?

mlarcher avatar Feb 14 '22 22:02 mlarcher

Waiting for .cf-browser-verification to be hidden means that you are on the cloudflare page (cf = cloudflare) and within 30 seconds are not being redirected to the actual opensea page. I think most likely opensea is detecting that you run the scraper from a google cloud IP and the cloudflare loop kicks in where it will refresh the page in an endless loop asking you to wait to resolve, which it never does.

I have no way around that currently, deploying scrapers on cloud infrastructure is difficult.

If you (or someone else) has ideas please share, its a very common problem.

One solution that might work but is costly is using a service like bright data (proxy with unblocker API).

dcts avatar Feb 15 '22 04:02 dcts

UPDATE: When running on GCP we now have a less frequent TimeoutError: waiting for selector ``.cf-browser-verification`` to be hidden failed: timeout 30000ms exceeded error, but when we don't have the error we end up with a empty offers list and stats, i.e.:

offers: []
stats: {}

I hope this will be fixed by v7's new approach 🤞

mlarcher avatar Mar 23 '22 22:03 mlarcher

REPORT FROM @mlarcher :

I dug a bit into the code and setup a test case... It seems that on GCP I'm stuck on a page that says

Checking your browser before accessing opensea.io.
This process is automatic. Your browser will redirect to your requested content shortly.

Please allow up to 5 seconds…
DDoS protection by [Cloudflare](https://www.cloudflare.com/5xx-error-landing/)

:(

From what I gathered :

  • I'm facing "Cloudflare Browser Integrity Check" (cf https://support.cloudflare.com/hc/en-us/articles/200170086-Understanding-the-Cloudflare-Browser-Integrity-Check )
  • Some hints to prevent being detected : https://stackoverflow.com/a/56529616/263440
  • But in https://stackoverflow.com/questions/62751377/bypass-cloudflare-with-puppeteer we read that "They will throw up a captcha if the ip is suspicious. Probably any datacenter ip would get one."
  • Some workaround might exist, but involving manual intervention (see https://stackoverflow.com/questions/62751377/bypass-cloudflare-with-puppeteer#comment117761360_62751377 ).
  • Comments following this one highlight the fact that this is an arms race, and a very comfy budget is being spent to prevent bypassing that kind of restriction, so it's kind of hopeless to imagine fully automating scraping on a production scale...

All in all this doesn't seem too good, but not directly related to the current library. Let me know if you have expertise on the matter and know some other way to tackle the problem though :)

dcts avatar Apr 12 '22 12:04 dcts

Bypassing cloudflare is definately not my expertise. I have tried to solve this problem for some time now, and it is definately possible but as you mentioned its an arms race. I tried these packages:

  • cloudflare-scraper in JS, did not work for me. To me it seems like its not maintained anymore.
  • cloudscraper python package. I managed to setup a google cloud run environmen with python and successfully overcome cloudflare. That was 3 months ago approximately. To make it work with OpenseaScraper you could: => only get HTML through python, then extract top 32 offers with the code provided in this repo. OR: rewrite everything in pypeteer, but that is just an idea as I am not even sure if that would work.

dcts avatar Apr 12 '22 12:04 dcts