httpagentparser icon indicating copy to clipboard operation
httpagentparser copied to clipboard

Not actually fast

Open pepijndevos opened this issue 11 years ago • 6 comments

I use this library on my logfiles. Half of the time is spent looking up IP addresses in a on-disk database, the other half is spent in httpagentparser. The time spent parsing the log file is marginal.

This change is obviously meant as a joke, but I suggest you do some profiling. Or I might even do some myself.

pepijndevos avatar Aug 05 '14 08:08 pepijndevos

Extracted bits from a profile of a small sample run of my application.

Tue Aug  5 10:42:46 2014    prof

         36372505 function calls (34746356 primitive calls) in 25.747 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)

  6482905    5.321    0.000    7.425    0.000 __init__.py:72(checkWords)
    99737    2.871    0.000   13.681    0.000 __init__.py:598(detect)
  6582642    2.843    0.000   10.721    0.000 __init__.py:59(detect)
    36635    0.124    0.000    0.196    0.000 __init__.py:84(getVersion)
    31946    0.072    0.000    0.128    0.000 __init__.py:488(getVersion)
    99737    0.055    0.000    0.055    0.000 __init__.py:218(checkWords)
    99737    0.052    0.000    0.069    0.000 __init__.py:30(__iter__)
...
lots of getVersion

pepijndevos avatar Aug 05 '14 08:08 pepijndevos

It seems to be a result of how the library works. It invokes all the detectors one by one to see if they match. This means speed decreases linearly as more browsers are added. So I actually made it twice as slow by contributing a ton of bots and mobile browsers.

So the only way to make a real difference is to detect less browsers, or majorly refactor.

It's imaginable to arrange browsers in a tree. For example, if a mobile OS is detected, all desktop detectors could be ignored. Or if Webkit is detected, all Gecko and Trident detectors could be ignored.

Another wild idea would be to flatten everything into a humongous regex/state machine. This requires more thought and design.

Related: https://github.com/clojure/core.match/wiki/Understanding-the-algorithm

pepijndevos avatar Aug 05 '14 09:08 pepijndevos

Will look into this once I am little free from my current work. Unsure how regex based solution will perform. Also if you have any other ideas/POC code please feel free.

shon avatar Aug 12 '14 13:08 shon

One quick soltion is moving not so popular agents to existing more.py and making them optional. Would be interesting to see if that makes hap faster?

This should really be a issue so more people can notice.

shon avatar Oct 24 '14 04:10 shon

That would make it faster, at the cost of detecting less. But yea, of you just want to detect the 5 major browsers on 3 major OSes, that's fine.

pepijndevos avatar Oct 24 '14 06:10 pepijndevos

I have to admit that, it's very slow...

lenisko avatar Feb 21 '18 20:02 lenisko