crawlee
crawlee copied to clipboard
Crawlers should have an option to respect robots.txt
Now you have to write your own function to parse and respect target website's robots.txt file. Common function in an SDK (utils.js probably) for that would be great.
I propose to use Robots Parser library with this common functions in utils.js :
- getRobotsTxt(url)
- isAllowedRobotsTxt(url, ua)
- isDisallowedRobotsTxt(url, ua)
- getCrawlDelayRobotsTxt(ua)
Ok, so what is the status of this? It's not clear where this has been addressed (or why it hasn't yet).
It's not implemented. There were not that many users requesting it and it's easy enough to implement by the users who need it. We might add it in the future, but there's no timeline.
Is this documented somewhere? I'm not interested in handcrafting one per URL, but rather looking at the robots.txt file and extracting the information about what to skip to be respectful of the site owner's direction.
See the comment above by LeMoussel.