engine
engine copied to clipboard
Received HTTP code 403 when trying to fetch a site using Cloudflare
Trying to add Roblox service and documents with the following declaration
{
"name": "Roblox",
"documents": {
"Privacy Policy": {
"fetch": "https://en.help.roblox.com/hc/en-us/articles/115004630823-Roblox-Privacy-and-Cookie-Policy-",
"select": [".article-body"],
"remove": [".wysiwyg-text-align-right img"]
},
"Terms of Service": {
"fetch": "https://en.help.roblox.com/hc/en-us/articles/115004647846-Roblox-Terms-of-Use",
"select": [".article"],
"remove": [".article-relatives", ".article-footer"]
},
"Community Guidelines": {
"fetch": "https://en.help.roblox.com/hc/en-us/articles/203313410-Roblox-Community-Rules",
"select": [".article"],
"remove": [".article-footer", ".article-relatives"]
}
}
}
I get this node error messages
Error: The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://en.help.roblox.com/hc/en-us/articles/115004630823-Roblox-Privacy-and-Cookie-Policy-'
Error: The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://en.help.roblox.com/hc/en-us/articles/115004647846-Roblox-Terms-of-Use'
Error: The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://en.help.roblox.com/hc/en-us/articles/203313410-Roblox-Community-Rules'
Same error trying to add Coinbase documents with following declaration
{
"name": "Coinbase",
"documents": {
"Privacy Policy": {
"fetch": "https://www.coinbase.com/legal/privacy",
"select": [".ComposePageLayout__ContentWrapper-sc-109zw5h-2"],
"remove": [".SidebarNav__NavigationLinksList-sc-1c3jy97-1"]
},
"Trackers Policy": {
"fetch": "https://www.coinbase.com/legal/cookie",
"select": [".ComposePageLayout__ContentWrapper-sc-109zw5h-2"],
"remove": [".SidebarNav__NavigationLinksList-sc-1c3jy97-1"]
},
"Terms of Service": {
"fetch": "https://www.coinbase.com/legal/user_agreement/ireland_europe",
"select": [".ComposePageLayout__ContentWrapper-sc-109zw5h-2"],
"remove": [".SidebarNav__NavigationLinksList-sc-1c3jy97-1"]
}
}
}
Content inacessible: Error: The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://www.coinbase.com/legal/user_agreement/ireland_europe'
This is mainly because those sites are using a service like cloudflare to check their traffic
Our attempt to scrape is evaluated as a bot and thus is blocked by a 403.
I tried the following all these with no success
- [X] change user agent
- [X] use proxies from free-list-proxies
- [X] adding referrer
- [X] adding referrer policy
- [X] use cloudflare-bypasser from github
So I suggest for now that you use "executeClientScripts"
In the meantime, I've send a ticket request to Cloudflare through my personnal premium account. Let's see what they say
Hi, My name is Martin Ratinaud, CTO at the French Embassy for Digital Affairs.
We are running the OpenSource project "Open Terms Archive" which aims at tracking ToS for every
service in the world, in all languages and all countries.
As such, we are implementing a crawler that tracks changes on ToS regularly.
We know we are currently blocked by your services and would like our bot to be trusted
by Cloudflare as a good bot (whitelisted) so that we are not blocked anymore
Thanks a lot
Check our websites here:
https://www.opentermsarchive.org/en
https://disinfo.quaidorsay.fr/en
And here is the response of cloudflare
Hi there,
Thanks for contacting Cloudflare support. My name is Yuri and I will be looking into this ticket for you.
To add a bot to Cloudflare's allowlist, please submit this online application.
For more information, please see: Frequently asked questions about Cloudflare bot products
Please let us know if you have any further questions or issues.
Yuri | Cloudflare Support
Search the Cloudflare Community for advice and insight.
Online application: https://docs.google.com/forms/d/e/1FAIpQLSdqYNuULEypMnp4i5pROSc-uP6x65Xub9svD27mb8JChA_-XA/viewform FAQ: https://support.cloudflare.com/hc/en-us/articles/360035387431-Frequently-asked-questions-about-Cloudflare-bot-products?source=search
@trujilloelsa @clementbiron @MattiSG I believe we should apply, what about you ?
Yes ✔️
Validation approval just submitted
Waiting for their answer
As we have not had any answer in 40 days, I created a new topic on Cloudflare community
https://community.cloudflare.com/t/cloudflare-bot-verification-submitted-but-no-answer/320260
I'm not sure this is a Cloudflare protection but running npm start Galeries Lafayette
i get
2022-02-22 16:19:18 warn Galeries Lafayette — Privacy Policy The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://www.galerieslafayette.com/service/service-confidence'
2022-02-22 16:19:18 warn Galeries Lafayette — Terms of Service The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://www.galerieslafayette.com/service/conditions-generals'
with the following declaration
{
"name": "Galeries Lafayette",
"documents": {
"Privacy Policy": {
"fetch": "https://www.galerieslafayette.com/service/service-confidence",
"select": [".mainContent"]
},
"Terms of Service": {
"fetch": "https://www.galerieslafayette.com/service/conditions-generals",
"select": [".mainContent"]
}
}
}
Same for
{
"name": "GO Sport",
"documents": {
"Commercial Terms": {
"fetch": "https://www.go-sport.com/cgv/",
"select": ["#content"]
},
"Privacy Policy": {
"fetch": "https://www.go-sport.com/charte-protection-donnees-clients/",
"select": ["#content"]
}
}
}
Same for this declaration https://github.com/OpenTermsArchive/declarations-france/commit/a0e6b465a74d2f60d5a48f014d5219801841c576
I'm not sure it's about Cloudflare protection, but the following declarations return a 403 error:
- Air Transat https://github.com/OpenTermsArchive/declarations-france/commit/421970fdf51e7378cd45d99a0cfcc54d1fe65bec
- Qatar Airways https://github.com/OpenTermsArchive/declarations-france/commit/17900b78a64aea9998ae799d4af2535ddcfd2d98
- Vélib https://github.com/OpenTermsArchive/france-declarations/commit/5aa5674190fdcb6b2d30c66561057cc7c8e252f4
We do not actively work on #166 at the moment. We will reopen it when we prioritise this work again. In the meantime, feel free to add any additional relevant information specific to Cloudflare to this issue.