grab-site icon indicating copy to clipboard operation
grab-site copied to clipboard

Cloudflare-protected site responds with 503 Service Temporarily Unavailable

Open rmfkdehd opened this issue 2 years ago • 5 comments

I installed grab-site on ubuntu 20.04 using nix.

The command I use is 'grab-site https://www.forexfactory.com/forums --concurrency=1' .

Example.com and other sites completed crawling, but the 'https://www.forexfactory.com/' site failed to crawl. I've also tried with sub-addresses.

Below is the log.

Imported /home/user/www.forexfactory.com-forums-2021-10-30-022ebe75/igsets
Imported /home/user/www.forexfactory.com-forums-2021-10-30-022ebe75/ignores
Connected to ws://127.0.0.1:29000
Imported /home/user/www.forexfactory.com-forums-2021-10-30-022ebe75/max_content_length
503 Service Temporarily Unavailable https://www.forexfactory.com/forums
Imported /home/user/www.forexfactory.com-forums-2021-10-30-022ebe75/delay
Imported /home/user/www.forexfactory.com-forums-2021-10-30-022ebe75/concurrency
/nix/store/12ip3ixhj0zbxy54pqqai0hssjrhgddg-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/protocol/http/client.py:185: UserWarning: HTTP session did not complete.
  warnings.warn(_('HTTP session did not complete.'))
200 OK https://www.forexfactory.com/robots.txt
503 Service Temporarily Unavailable https://www.forexfactory.com/sitemap.xml
503 Service Temporarily Unavailable https://www.forexfactory.com/forums
503 Service Temporarily Unavailable https://www.forexfactory.com/sitemap-index.xml
503 Service Temporarily Unavailable https://www.forexfactory.com/sitemap.xml
503 Service Temporarily Unavailable https://www.forexfactory.com/forums
503 Service Temporarily Unavailable https://www.forexfactory.com/sitemap-index.xml
503 Service Temporarily Unavailable https://www.forexfactory.com/sitemap.xml
503 Service Temporarily Unavailable https://www.forexfactory.com/sitemap-index.xml
Finished grab 022ebe7544a3c4163f989e95c54d3d54 https://www.forexfactory.com/forums with exit code 8
Output is in directory:
/home/user/www.forexfactory.com-forums-2021-10-30-022ebe75

rmfkdehd avatar Oct 30 '21 15:10 rmfkdehd

You sure that the site is up?

Also, are you sure that you aren't banned?

TheTechRobo avatar Oct 30 '21 16:10 TheTechRobo

I can still go into regular chrome.. no problem at all.

rmfkdehd avatar Oct 31 '21 00:10 rmfkdehd

weird. maybe the site requires JS and if you don't have it, bans you?

otherwise idk

TheTechRobo avatar Oct 31 '21 02:10 TheTechRobo

@TheTechRobo please don't speculate like this in the issues, try to reproduce the issue yourself if you're interested in it.

Anyway, I see

DDoS protection by <a rel="noopener noreferrer" href="https://www.cloudflare.com/5xx-error-landing/" target="_blank">Cloudflare</a>

in the resulting WARC when trying to crawl this forum.

cloudflare is known to block bots sending the wrong TLS fingerprint. It is probably picking up on grab-site's 'incorrect' TLS fingerprint, which does not match the browser it claims to be (Firefox). We might be able to fix that in ludios_wpull.

ivan avatar Oct 31 '21 02:10 ivan

@TheTechRobo please don't speculate like this in the issues, try to reproduce the issue yourself if you're interested in it.

@ivan Gotcha. :+1:

TheTechRobo avatar Oct 31 '21 03:10 TheTechRobo