engine icon indicating copy to clipboard operation
engine copied to clipboard

Received HTTP code 403 when trying to fetch a site using Cloudflare

Open clementbiron opened this issue 3 years ago • 10 comments

Trying to add Roblox service and documents with the following declaration

{
  "name": "Roblox",
  "documents": {
    "Privacy Policy": {
      "fetch": "https://en.help.roblox.com/hc/en-us/articles/115004630823-Roblox-Privacy-and-Cookie-Policy-",
      "select": [".article-body"],
      "remove": [".wysiwyg-text-align-right img"]
    },
    "Terms of Service": {
      "fetch": "https://en.help.roblox.com/hc/en-us/articles/115004647846-Roblox-Terms-of-Use",
      "select": [".article"],
      "remove": [".article-relatives", ".article-footer"]
    },
    "Community Guidelines": {
      "fetch": "https://en.help.roblox.com/hc/en-us/articles/203313410-Roblox-Community-Rules",
      "select": [".article"],
      "remove": [".article-footer", ".article-relatives"]
    }
  }
}

I get this node error messages

Error: The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://en.help.roblox.com/hc/en-us/articles/115004630823-Roblox-Privacy-and-Cookie-Policy-'

Error: The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://en.help.roblox.com/hc/en-us/articles/115004647846-Roblox-Terms-of-Use'

Error: The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://en.help.roblox.com/hc/en-us/articles/203313410-Roblox-Community-Rules'

clementbiron avatar Aug 16 '21 13:08 clementbiron

Same error trying to add Coinbase documents with following declaration

{
  "name": "Coinbase",
  "documents": {
    "Privacy Policy": {
      "fetch": "https://www.coinbase.com/legal/privacy",
      "select": [".ComposePageLayout__ContentWrapper-sc-109zw5h-2"],
      "remove": [".SidebarNav__NavigationLinksList-sc-1c3jy97-1"]
    },
    "Trackers Policy": {
      "fetch": "https://www.coinbase.com/legal/cookie",
      "select": [".ComposePageLayout__ContentWrapper-sc-109zw5h-2"],
      "remove": [".SidebarNav__NavigationLinksList-sc-1c3jy97-1"]
    },
    "Terms of Service": {
      "fetch": "https://www.coinbase.com/legal/user_agreement/ireland_europe",
      "select": [".ComposePageLayout__ContentWrapper-sc-109zw5h-2"],
      "remove": [".SidebarNav__NavigationLinksList-sc-1c3jy97-1"]
    }
  }
}

Content inacessible: Error: The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://www.coinbase.com/legal/user_agreement/ireland_europe'

clementbiron avatar Aug 19 '21 13:08 clementbiron

This is mainly because those sites are using a service like cloudflare to check their traffic

Our attempt to scrape is evaluated as a bot and thus is blocked by a 403.

I tried the following all these with no success

  • [X] change user agent
  • [X] use proxies from free-list-proxies
  • [X] adding referrer
  • [X] adding referrer policy
  • [X] use cloudflare-bypasser from github

So I suggest for now that you use "executeClientScripts"

In the meantime, I've send a ticket request to Cloudflare through my personnal premium account. Let's see what they say

Hi, My name is Martin Ratinaud, CTO at the French Embassy for Digital Affairs.  

We are running the OpenSource project "Open Terms Archive" which aims at tracking ToS for every 
service in the world, in all languages and all countries.  
As such, we are implementing a crawler that tracks changes on ToS regularly.  
We know we are currently blocked by your services and would like our bot to be trusted 
by Cloudflare as a good bot (whitelisted) so that we are not blocked anymore 

Thanks a lot

Check our websites here: 
https://www.opentermsarchive.org/en 
https://disinfo.quaidorsay.fr/en

martinratinaud avatar Aug 26 '21 06:08 martinratinaud

And here is the response of cloudflare

Hi there,

Thanks for contacting Cloudflare support. My name is Yuri and I will be looking into this ticket for you.

To add a bot to Cloudflare's allowlist, please submit this online application.

For more information, please see: Frequently asked questions about Cloudflare bot products

Please let us know if you have any further questions or issues.

Yuri | Cloudflare Support
Search the Cloudflare Community for advice and insight.

Online application: https://docs.google.com/forms/d/e/1FAIpQLSdqYNuULEypMnp4i5pROSc-uP6x65Xub9svD27mb8JChA_-XA/viewform FAQ: https://support.cloudflare.com/hc/en-us/articles/360035387431-Frequently-asked-questions-about-Cloudflare-bot-products?source=search

@trujilloelsa @clementbiron @MattiSG I believe we should apply, what about you ?

martinratinaud avatar Aug 30 '21 03:08 martinratinaud

Yes ✔️

clementbiron avatar Aug 30 '21 06:08 clementbiron

Validation approval just submitted

docs google com_forms_d_e_1FAIpQLSdqYNuULEypMnp4i5pROSc-uP6x65Xub9svD27mb8JChA_-XA_viewform

Waiting for their answer

martinratinaud avatar Sep 16 '21 07:09 martinratinaud

As we have not had any answer in 40 days, I created a new topic on Cloudflare community

https://community.cloudflare.com/t/cloudflare-bot-verification-submitted-but-no-answer/320260

martinratinaud avatar Oct 27 '21 05:10 martinratinaud

I'm not sure this is a Cloudflare protection but running npm start Galeries Lafayette i get

2022-02-22 16:19:18 warn  Galeries Lafayette — Privacy Policy                     The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://www.galerieslafayette.com/service/service-confidence'
2022-02-22 16:19:18 warn  Galeries Lafayette — Terms of Service                   The document cannot be accessed or its content can not be selected: Received HTTP code 403 when trying to fetch 'https://www.galerieslafayette.com/service/conditions-generals'

with the following declaration

{
  "name": "Galeries Lafayette",
  "documents": {
    "Privacy Policy": {
      "fetch": "https://www.galerieslafayette.com/service/service-confidence",
      "select": [".mainContent"]
    },
    "Terms of Service": {
      "fetch": "https://www.galerieslafayette.com/service/conditions-generals",
      "select": [".mainContent"]
    }
  }
}

clementbiron avatar Feb 22 '22 15:02 clementbiron

Same for

{
  "name": "GO Sport",
  "documents": {
    "Commercial Terms": {
      "fetch": "https://www.go-sport.com/cgv/",
      "select": ["#content"]
    },
    "Privacy Policy": {
      "fetch": "https://www.go-sport.com/charte-protection-donnees-clients/",
      "select": ["#content"]
    }
  }
}

clementbiron avatar Feb 22 '22 15:02 clementbiron

Same for this declaration https://github.com/OpenTermsArchive/declarations-france/commit/a0e6b465a74d2f60d5a48f014d5219801841c576

clementbiron avatar Feb 23 '22 07:02 clementbiron

I'm not sure it's about Cloudflare protection, but the following declarations return a 403 error:

  • Air Transat https://github.com/OpenTermsArchive/declarations-france/commit/421970fdf51e7378cd45d99a0cfcc54d1fe65bec
  • Qatar Airways https://github.com/OpenTermsArchive/declarations-france/commit/17900b78a64aea9998ae799d4af2535ddcfd2d98
  • Vélib https://github.com/OpenTermsArchive/france-declarations/commit/5aa5674190fdcb6b2d30c66561057cc7c8e252f4

clementbiron avatar Mar 22 '22 15:03 clementbiron

We do not actively work on #166 at the moment. We will reopen it when we prioritise this work again. In the meantime, feel free to add any additional relevant information specific to Cloudflare to this issue.

MattiSG avatar Apr 24 '23 09:04 MattiSG