crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Crawlers should have an option to respect robots.txt

Open jakubbalada opened this issue 7 years ago • 5 comments
trafficstars

Now you have to write your own function to parse and respect target website's robots.txt file. Common function in an SDK (utils.js probably) for that would be great.

jakubbalada avatar Nov 13 '18 09:11 jakubbalada

I propose to use Robots Parser library with this common functions in utils.js :

  • getRobotsTxt(url)
  • isAllowedRobotsTxt(url, ua)
  • isDisallowedRobotsTxt(url, ua)
  • getCrawlDelayRobotsTxt(ua)

LeMoussel avatar Feb 21 '19 07:02 LeMoussel

Ok, so what is the status of this? It's not clear where this has been addressed (or why it hasn't yet).

mgifford avatar Oct 26 '21 14:10 mgifford

It's not implemented. There were not that many users requesting it and it's easy enough to implement by the users who need it. We might add it in the future, but there's no timeline.

mnmkng avatar Oct 27 '21 06:10 mnmkng

Is this documented somewhere? I'm not interested in handcrafting one per URL, but rather looking at the robots.txt file and extracting the information about what to skip to be respectful of the site owner's direction.

mgifford avatar Oct 27 '21 12:10 mgifford

See the comment above by LeMoussel.

mnmkng avatar Oct 27 '21 16:10 mnmkng