headless-chrome-crawler Suggestion: robots.txt shouldn't be reparsed every time

Suggestion: robots.txt shouldn't be reparsed every time

Open panthony opened this issue 6 years ago • 0 comments

What is the current behavior?

The robots.txt is re-parsed for every request but those files can be big.

Today Google only reads the first 500 Kb and ignore the rest.

What is the expected behavior?

Maybe the crawler could keep the parsed robots.txt up to N instances. It should allow a strong cache hit without allowing it to growth forever.

What is the motivation / use case for changing the behavior?

Although I didn't manage to find the robots.txt again, I did already see ones that were doing easily > 1Mb.

The overall performance could take a serious hit if it were to be reparsed for every single request.

Apr 04 '18 10:04 panthony