mwoffliner icon indicating copy to clipboard operation
mwoffliner copied to clipboard

Avoid being blocked by Cloudflare

Open benoit74 opened this issue 10 months ago • 9 comments

mwoffliner version : 1.14.0

Task: https://farm.openzim.org/pipeline/1e755f21-4805-4cf8-8fa1-63fd5a5dc9d5/debug Recipe: https://farm.openzim.org/recipes/cyclowiki.org_rus_all Request: https://github.com/openzim/zim-requests/issues/9

Log:

[error] [2025-01-13T13:46:08.126Z] Failed to run mwoffliner after [1s]: {
	"stack": "Error: mwUrl [https://cyclowiki.org] is not valid.\n    at file:///tmp/mwoffliner/lib/sanitize-argument.js:134:15\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async sanitize_mwUrl (file:///tmp/mwoffliner/lib/sanitize-argument.js:133:5)\n    at async sanitize_all (file:///tmp/mwoffliner/lib/sanitize-argument.js:55:5)",
	"message": "mwUrl [https://cyclowiki.org] is not valid."
}
[error] [2025-01-13T13:46:08.127Z] 

**********

mwUrl [https://cyclowiki.org] is not valid.

**********

Explanation: first check of mwUrl seems to be failing. Could be caused by the fact that Cloudflare is protecting this website. To be investigated.

benoit74 avatar Jan 13 '25 13:01 benoit74

Yes, we have had this exact problem before with cloudflare.

audiodude avatar Jan 14 '25 03:01 audiodude

See #2039

audiodude avatar Jan 14 '25 03:01 audiodude

Yeah, I saw this other issue, where we just skipped the test to solve it. Now that we have repro, I suspect we might be able to do something by passing a proper User-Agent. At least this is what we achieved to do in other scrapers. Not bullet-proof, but a "bad" User-Agent triggers much more easily Cloudflare protections. By "bad", I mean something which does not look at all like a browser.

benoit74 avatar Jan 14 '25 06:01 benoit74

Not bullet-proof, but a "bad" User-Agent triggers much more easily Cloudflare protections. By "bad", I mean something which does not look at all like a browser.

Worth a try indeed. If it works, we should create an option for that.

kelson42 avatar Jan 14 '25 08:01 kelson42

Fixed by https://github.com/openzim/mwoffliner/pull/2149

benoit74 avatar Feb 19 '25 17:02 benoit74

Issue is back with Appropedia (Cyclowiki is still fine)

benoit74 avatar Feb 20 '25 10:02 benoit74

Hi! We recently had to activate Cloudflare's standard bot protection, because we were receiving so many bot requests that they actually brought the site down.

Image

The standard bot protection blocks all bots except verified non-AI bots. I just softened the protection to block only AI bots and specific problematic bots, so we should be good to go now.

Later down the road you may want to investigate what counts as a "Verified bot" to Cloudflare, to avoid future blocks by their standard bot protection. With the rise of AI, more and more sites may find the need to set up said protection, like we did.

Cheers!

Sophivorus avatar Feb 20 '25 13:02 Sophivorus

Thanks a lot for removing and standard bot protection, and even a bigger thank you for all this insight into Cloudflare. I always imagine we might need to dig into this, but having first-hand information confirming that it might be possible to be whitelisted by Cloudflare is very valuable.

benoit74 avatar Feb 20 '25 13:02 benoit74

Glad to be of service! I just re-enabled protection but only against AI bots. It shouldn't affect you unless Cloudflare falsely detects you as an AI bot. If they do, just let me know I'll disable that protection again. Cheers!

Sophivorus avatar Feb 20 '25 14:02 Sophivorus