headless-chrome-crawler Crawler should honor the Crawl-Delay if obeyRobotsTxt:true

Crawler should honor the Crawl-Delay if obeyRobotsTxt:true

Open panthony opened this issue 6 years ago • 2 comments

What is the current behavior?

The Crawl-Delay is ignored.

What is the expected behavior?

The Crawl-Delay should be honored, it can be retrieved using getCrawlDelay() on the robots parser.

What is the motivation / use case for changing the behavior?

A bot is bound to respect all the directives of the robots.txt

Apr 03 '18 14:04 panthony

@panthony Crawler-Delay is not part of the standard, so there is no way we can tell the number is seconds, minutes, hours or days. Probably providing robots.txt should be the direct solution to your use case: https://github.com/yujiosaka/headless-chrome-crawler/issues/192

Apr 03 '18 15:04 yujiosaka

@yujiosaka You are right, this is not part of the standard.

But it looks like everyone agree that it is expected to be as a number of seconds and if the crawler may not obey it out of the box we should have some way to enforce it.

It would be sad to be banned from accessing a site because we did not obey their rules :)

I do not quite see how providing a robots.txt could be a solution?

Or you meant like I could configure the delay of the crawler according to the robots.txt I provide?

Apr 04 '18 07:04 panthony

headless-chrome-crawler headless-chrome-crawler copied to clipboard

Crawler should honor the Crawl-Delay if obeyRobotsTxt:true

headless-chrome-crawler
headless-chrome-crawler copied to clipboard