Robots.txt-Parser-Class
Robots.txt-Parser-Class copied to clipboard
Byte limit
Feature request: Limit the maximum number of bytes to parse.
A maximum file size may be enforced per crawler. Content which is after the maximum file size may be ignored. Google currently enforces a size limit of 500 kilobytes (KB).
Source: Google
When forming the robots.txt file, you should keep in mind that the robot places a reasonable limit on its size. If the file size exceeds 32 KB, the robot assumes it allows everything
Source: Yandex
- [ ] Default limit of X bytes, eg. 524.288 bytes (512KB / 0.5MB)
- [ ] User-defined limit override
- [ ] Make sure the limit is reasonable, throw an exception if dangerously low, eg. 24.576 bytes (24 KB)
- [ ] Should be able to disable - no limit
At the moment, it's possible to generate large (fake or valid) robots.txt files, with the aim to trap the robots.txt crawler, slow down the server, and even cause it to hang or crash.
It's also (depending on the setup) possible to trap the crawler in an infinite retry-loop, if the external code utilizing this library, isn't handling repeating fatal errors correctly...
Related to #62