crawlee Crawlers should have an option to respect robots.txt

Crawlers should have an option to respect robots.txt

Open jakubbalada opened this issue 7 years ago • 5 comments

trafficstars

Now you have to write your own function to parse and respect target website's robots.txt file. Common function in an SDK (utils.js probably) for that would be great.

Nov 13 '18 09:11 jakubbalada

I propose to use Robots Parser library with this common functions in utils.js :

getRobotsTxt(url)
isAllowedRobotsTxt(url, ua)
isDisallowedRobotsTxt(url, ua)
getCrawlDelayRobotsTxt(ua)

Feb 21 '19 07:02 LeMoussel

Ok, so what is the status of this? It's not clear where this has been addressed (or why it hasn't yet).

Oct 26 '21 14:10 mgifford

It's not implemented. There were not that many users requesting it and it's easy enough to implement by the users who need it. We might add it in the future, but there's no timeline.

Oct 27 '21 06:10 mnmkng

Is this documented somewhere? I'm not interested in handcrafting one per URL, but rather looking at the robots.txt file and extracting the information about what to skip to be respectful of the site owner's direction.

Oct 27 '21 12:10 mgifford

See the comment above by LeMoussel.

Oct 27 '21 16:10 mnmkng

crawlee crawlee copied to clipboard

Crawlers should have an option to respect robots.txt

crawlee
crawlee copied to clipboard