Robots.txt-Parser-Class
Robots.txt-Parser-Class copied to clipboard
Parsing performance issue
When parsing large robots.txt files, the process can take several minutes to finish, even with 100% CPU power dedicated.
This is a problem in general, and is not related to any specific robots.txt file.
With multiple parallell processes, it may even lock your CPU at 100% for hours, if the script don't get terminated. This is what happen to me a few days ago... I'm not talking about the RPi, but dedicated servers using Intel Xeon CPUs.
Example URL: http://visitstavern.no/robots.txt (just one of many) 4170 lines, 236,999 characters, HTTP 200, HTML document without any valid rule.
The problem is that the script is looping thou each character, so if let's say one of the lines are 300 characters long, it will loop thou all 300, before it continues to next line, even if there is no directive on that line...
Solution:
- Split file into separate lines using
\r\n|\r|\n
- Strip white spaces with
array_map('trim, $array)
- Remove comments (
#
) - Parse thru each item (line) in the array, if it is starting with any known directive, parse it, otherwise simply skip it
The result? parsing these large files in a second or two, instead of spending minutes or even hours wasting CPU power...
To get completely rid of the performance issues, both this issue and #75 needs to be fixed.
I'd also discover this interesting robots.txt
file:
http://www.goldmansachs.com/robots.txt
At time of writing, it's 415 KB (425 397 bytes)... Perfect for benchmark tests :tada:
@JanPetterMG could you please check this issue again with the new version? It should perform a lot better now.
I've transitioned to an competing library, witch is relatively deeply integrated into an web crawler, running in a production environment. Would love to benchmark, and compare to see if there's any real world differences (or bugs), but time to do so, is the main issue.
I'll keep you updated, if and when, I get the time to do so...
I've transitioned to an competing library
Sad to hear that :) But thanks any way
Give me a week or two, and I'll implement this in production. Sorry for dragging this out in time...
Doing some quick testing, and honestly, I still don't know, if I ever can trust this library, in production... Found an pretty big, fundamental and obvious bug. #93
Literally in the very first execution of this library, in many years... Now I don't even want to reconsider testing this library further, until both this and any other known bugs are all fixed. Additionally, the test coverage should be raised significantly.
I hope no one is running this library in an production environment, because they shouldn't.
I'm so sorry, I don't want to harass, but this is an very fundamental and obvious bug!