httpagentparser
httpagentparser copied to clipboard
Not actually fast
I use this library on my logfiles. Half of the time is spent looking up IP addresses in a on-disk database, the other half is spent in httpagentparser. The time spent parsing the log file is marginal.
This change is obviously meant as a joke, but I suggest you do some profiling. Or I might even do some myself.
Extracted bits from a profile of a small sample run of my application.
Tue Aug 5 10:42:46 2014 prof
36372505 function calls (34746356 primitive calls) in 25.747 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
6482905 5.321 0.000 7.425 0.000 __init__.py:72(checkWords)
99737 2.871 0.000 13.681 0.000 __init__.py:598(detect)
6582642 2.843 0.000 10.721 0.000 __init__.py:59(detect)
36635 0.124 0.000 0.196 0.000 __init__.py:84(getVersion)
31946 0.072 0.000 0.128 0.000 __init__.py:488(getVersion)
99737 0.055 0.000 0.055 0.000 __init__.py:218(checkWords)
99737 0.052 0.000 0.069 0.000 __init__.py:30(__iter__)
...
lots of getVersion
It seems to be a result of how the library works. It invokes all the detectors one by one to see if they match. This means speed decreases linearly as more browsers are added. So I actually made it twice as slow by contributing a ton of bots and mobile browsers.
So the only way to make a real difference is to detect less browsers, or majorly refactor.
It's imaginable to arrange browsers in a tree. For example, if a mobile OS is detected, all desktop detectors could be ignored. Or if Webkit is detected, all Gecko and Trident detectors could be ignored.
Another wild idea would be to flatten everything into a humongous regex/state machine. This requires more thought and design.
Related: https://github.com/clojure/core.match/wiki/Understanding-the-algorithm
Will look into this once I am little free from my current work. Unsure how regex based solution will perform. Also if you have any other ideas/POC code please feel free.
One quick soltion is moving not so popular agents to existing more.py and making them optional. Would be interesting to see if that makes hap faster?
This should really be a issue so more people can notice.
That would make it faster, at the cost of detecting less. But yea, of you just want to detect the 5 major browsers on 3 major OSes, that's fine.
I have to admit that, it's very slow...