grab-site
grab-site copied to clipboard
Cloudflare-protected site responds with 503 Service Temporarily Unavailable
I installed grab-site on ubuntu 20.04 using nix.
The command I use is 'grab-site https://www.forexfactory.com/forums --concurrency=1' .
Example.com and other sites completed crawling, but the 'https://www.forexfactory.com/' site failed to crawl. I've also tried with sub-addresses.
Below is the log.
Imported /home/user/www.forexfactory.com-forums-2021-10-30-022ebe75/igsets
Imported /home/user/www.forexfactory.com-forums-2021-10-30-022ebe75/ignores
Connected to ws://127.0.0.1:29000
Imported /home/user/www.forexfactory.com-forums-2021-10-30-022ebe75/max_content_length
503 Service Temporarily Unavailable https://www.forexfactory.com/forums
Imported /home/user/www.forexfactory.com-forums-2021-10-30-022ebe75/delay
Imported /home/user/www.forexfactory.com-forums-2021-10-30-022ebe75/concurrency
/nix/store/12ip3ixhj0zbxy54pqqai0hssjrhgddg-python3.7-ludios_wpull-3.0.7/lib/python3.7/site-packages/wpull/protocol/http/client.py:185: UserWarning: HTTP session did not complete.
warnings.warn(_('HTTP session did not complete.'))
200 OK https://www.forexfactory.com/robots.txt
503 Service Temporarily Unavailable https://www.forexfactory.com/sitemap.xml
503 Service Temporarily Unavailable https://www.forexfactory.com/forums
503 Service Temporarily Unavailable https://www.forexfactory.com/sitemap-index.xml
503 Service Temporarily Unavailable https://www.forexfactory.com/sitemap.xml
503 Service Temporarily Unavailable https://www.forexfactory.com/forums
503 Service Temporarily Unavailable https://www.forexfactory.com/sitemap-index.xml
503 Service Temporarily Unavailable https://www.forexfactory.com/sitemap.xml
503 Service Temporarily Unavailable https://www.forexfactory.com/sitemap-index.xml
Finished grab 022ebe7544a3c4163f989e95c54d3d54 https://www.forexfactory.com/forums with exit code 8
Output is in directory:
/home/user/www.forexfactory.com-forums-2021-10-30-022ebe75
You sure that the site is up?
Also, are you sure that you aren't banned?
I can still go into regular chrome.. no problem at all.
weird. maybe the site requires JS and if you don't have it, bans you?
otherwise idk
@TheTechRobo please don't speculate like this in the issues, try to reproduce the issue yourself if you're interested in it.
Anyway, I see
DDoS protection by <a rel="noopener noreferrer" href="https://www.cloudflare.com/5xx-error-landing/" target="_blank">Cloudflare</a>
in the resulting WARC when trying to crawl this forum.
cloudflare is known to block bots sending the wrong TLS fingerprint. It is probably picking up on grab-site's 'incorrect' TLS fingerprint, which does not match the browser it claims to be (Firefox). We might be able to fix that in ludios_wpull.
@TheTechRobo please don't speculate like this in the issues, try to reproduce the issue yourself if you're interested in it.
@ivan Gotcha. :+1: