headless-chrome-crawler
headless-chrome-crawler copied to clipboard
Crawler should honor the Crawl-Delay if obeyRobotsTxt:true
What is the current behavior?
The Crawl-Delay
is ignored.
What is the expected behavior?
The Crawl-Delay
should be honored, it can be retrieved using getCrawlDelay()
on the robots parser.
What is the motivation / use case for changing the behavior?
A bot is bound to respect all the directives of the robots.txt
@panthony
Crawler-Delay
is not part of the standard, so there is no way we can tell the number is seconds, minutes, hours or days.
Probably providing robots.txt should be the direct solution to your use case: https://github.com/yujiosaka/headless-chrome-crawler/issues/192
@yujiosaka You are right, this is not part of the standard.
But it looks like everyone agree that it is expected to be as a number of seconds and if the crawler may not obey it out of the box we should have some way to enforce it.
It would be sad to be banned from accessing a site because we did not obey their rules :)
I do not quite see how providing a robots.txt could be a solution?
Or you meant like I could configure the delay
of the crawler according to the robots.txt
I provide?