Robots.txt-Parser-Class icon indicating copy to clipboard operation
Robots.txt-Parser-Class copied to clipboard

Parsing performance issue

Open JanPetterMG opened this issue 8 years ago • 6 comments

When parsing large robots.txt files, the process can take several minutes to finish, even with 100% CPU power dedicated.

This is a problem in general, and is not related to any specific robots.txt file.

With multiple parallell processes, it may even lock your CPU at 100% for hours, if the script don't get terminated. This is what happen to me a few days ago... I'm not talking about the RPi, but dedicated servers using Intel Xeon CPUs.

Example URL: http://visitstavern.no/robots.txt (just one of many) 4170 lines, 236,999 characters, HTTP 200, HTML document without any valid rule.

The problem is that the script is looping thou each character, so if let's say one of the lines are 300 characters long, it will loop thou all 300, before it continues to next line, even if there is no directive on that line...

Solution:

  1. Split file into separate lines using \r\n|\r|\n
  2. Strip white spaces with array_map('trim, $array)
  3. Remove comments (#)
  4. Parse thru each item (line) in the array, if it is starting with any known directive, parse it, otherwise simply skip it

The result? parsing these large files in a second or two, instead of spending minutes or even hours wasting CPU power...

JanPetterMG avatar Mar 16 '16 18:03 JanPetterMG

To get completely rid of the performance issues, both this issue and #75 needs to be fixed.

I'd also discover this interesting robots.txt file: http://www.goldmansachs.com/robots.txt At time of writing, it's 415 KB (425 397 bytes)... Perfect for benchmark tests :tada:

JanPetterMG avatar Aug 08 '16 17:08 JanPetterMG

@JanPetterMG could you please check this issue again with the new version? It should perform a lot better now.

t1gor avatar Oct 29 '21 10:10 t1gor

I've transitioned to an competing library, witch is relatively deeply integrated into an web crawler, running in a production environment. Would love to benchmark, and compare to see if there's any real world differences (or bugs), but time to do so, is the main issue.

I'll keep you updated, if and when, I get the time to do so...

JanPetterMG avatar Oct 29 '21 15:10 JanPetterMG

I've transitioned to an competing library

Sad to hear that :) But thanks any way

t1gor avatar Dec 07 '21 07:12 t1gor

Give me a week or two, and I'll implement this in production. Sorry for dragging this out in time...

JanPetterMG avatar Jan 06 '22 23:01 JanPetterMG

Doing some quick testing, and honestly, I still don't know, if I ever can trust this library, in production... Found an pretty big, fundamental and obvious bug. #93

Literally in the very first execution of this library, in many years... Now I don't even want to reconsider testing this library further, until both this and any other known bugs are all fixed. Additionally, the test coverage should be raised significantly.

I hope no one is running this library in an production environment, because they shouldn't.

I'm so sorry, I don't want to harass, but this is an very fundamental and obvious bug!

JanPetterMG avatar Jan 07 '22 02:01 JanPetterMG