mwoffliner
mwoffliner copied to clipboard
Avoid being blocked by Cloudflare
mwoffliner version : 1.14.0
Task: https://farm.openzim.org/pipeline/1e755f21-4805-4cf8-8fa1-63fd5a5dc9d5/debug Recipe: https://farm.openzim.org/recipes/cyclowiki.org_rus_all Request: https://github.com/openzim/zim-requests/issues/9
Log:
[error] [2025-01-13T13:46:08.126Z] Failed to run mwoffliner after [1s]: {
"stack": "Error: mwUrl [https://cyclowiki.org] is not valid.\n at file:///tmp/mwoffliner/lib/sanitize-argument.js:134:15\n at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n at async sanitize_mwUrl (file:///tmp/mwoffliner/lib/sanitize-argument.js:133:5)\n at async sanitize_all (file:///tmp/mwoffliner/lib/sanitize-argument.js:55:5)",
"message": "mwUrl [https://cyclowiki.org] is not valid."
}
[error] [2025-01-13T13:46:08.127Z]
**********
mwUrl [https://cyclowiki.org] is not valid.
**********
Explanation: first check of mwUrl seems to be failing. Could be caused by the fact that Cloudflare is protecting this website. To be investigated.
Yes, we have had this exact problem before with cloudflare.
See #2039
Yeah, I saw this other issue, where we just skipped the test to solve it. Now that we have repro, I suspect we might be able to do something by passing a proper User-Agent. At least this is what we achieved to do in other scrapers. Not bullet-proof, but a "bad" User-Agent triggers much more easily Cloudflare protections. By "bad", I mean something which does not look at all like a browser.
Not bullet-proof, but a "bad" User-Agent triggers much more easily Cloudflare protections. By "bad", I mean something which does not look at all like a browser.
Worth a try indeed. If it works, we should create an option for that.
Fixed by https://github.com/openzim/mwoffliner/pull/2149
Issue is back with Appropedia (Cyclowiki is still fine)
Hi! We recently had to activate Cloudflare's standard bot protection, because we were receiving so many bot requests that they actually brought the site down.
The standard bot protection blocks all bots except verified non-AI bots. I just softened the protection to block only AI bots and specific problematic bots, so we should be good to go now.
Later down the road you may want to investigate what counts as a "Verified bot" to Cloudflare, to avoid future blocks by their standard bot protection. With the rise of AI, more and more sites may find the need to set up said protection, like we did.
Cheers!
Thanks a lot for removing and standard bot protection, and even a bigger thank you for all this insight into Cloudflare. I always imagine we might need to dig into this, but having first-hand information confirming that it might be possible to be whitelisted by Cloudflare is very valuable.
Glad to be of service! I just re-enabled protection but only against AI bots. It shouldn't affect you unless Cloudflare falsely detects you as an AI bot. If they do, just let me know I'll disable that protection again. Cheers!