Robots.txt-Parser-Class icon indicating copy to clipboard operation
Robots.txt-Parser-Class copied to clipboard

Byte limit

Open JanPetterMG opened this issue 8 years ago • 1 comments

Feature request: Limit the maximum number of bytes to parse.

A maximum file size may be enforced per crawler. Content which is after the maximum file size may be ignored. Google currently enforces a size limit of 500 kilobytes (KB).

Source: Google

When forming the robots.txt file, you should keep in mind that the robot places a reasonable limit on its size. If the file size exceeds 32 KB, the robot assumes it allows everything

Source: Yandex

  • [ ] Default limit of X bytes, eg. 524.288 bytes (512KB / 0.5MB)
  • [ ] User-defined limit override
  • [ ] Make sure the limit is reasonable, throw an exception if dangerously low, eg. 24.576 bytes (24 KB)
  • [ ] Should be able to disable - no limit

JanPetterMG avatar Aug 08 '16 17:08 JanPetterMG

At the moment, it's possible to generate large (fake or valid) robots.txt files, with the aim to trap the robots.txt crawler, slow down the server, and even cause it to hang or crash.

It's also (depending on the setup) possible to trap the crawler in an infinite retry-loop, if the external code utilizing this library, isn't handling repeating fatal errors correctly...

Related to #62

JanPetterMG avatar Aug 08 '16 17:08 JanPetterMG