uadetector icon indicating copy to clipboard operation
uadetector copied to clipboard

Lots of regular expressions in the resources module seem to be poorly optimized

Open besief opened this issue 11 years ago • 4 comments

Hello everyone,

we have been using uadetector in production for a while now and always noticed how it soaked up most of our CPU time (we categorize about 3k user agents a second at peak times). Now I'm taking a look into this and from what I see, it seems like there isn't any particular effort put into the regex patterns that are being used.

Just some random examples:

device_reg id 95 has (P1000|P1010|P3100|P3105|P3110|P3113|P5100|P5110|P5113|P5200|P5210|P6200|P6201|P6210|P6211|P6800|P6810|P7110|P7300|P7310|P7320|P7500|P7510|P7511), where at least the P could be pulled out.

device_reg 115 has (672\.0\.2|672\.0\.8|672\.1\.12|672\.1\.13|672\.1\.14|672\.1\.15) where the 672. could be pulled out first and then 1.1 in one and 0. in another sub-expression.

os 34 has three patterns registered, that could easily be combined into one

            <operating_system_reg>
                <order>209</order>
                <os_id>34</os_id>
                <regstring>/NokiaN97/si</regstring>
            </operating_system_reg>
            <operating_system_reg>
                <order>210</order>
                <os_id>34</os_id>
                <regstring>/Nokia.*XpressMusic/si</regstring>
            </operating_system_reg>
            <operating_system_reg>
                <order>211</order>
                <os_id>34</os_id>
                <regstring>/NokiaE66/si</regstring>
            </operating_system_reg>

And generally there are lots of .*s (especially in the device category patterns), which will backtrack a lot.

Now I am not sure if this is kind of intentional, to achieve better readability and clarity, a kind of documentary purpose (although there are some really glaring mistakes like /.*windows 95.*/si), expecting people to mostly supply their own, customized uas.xml, or whether there just hasn't been anyone interested in optimizing all those regexs. If it is the latter, I might just take a shot at it.

besief avatar Nov 28 '14 10:11 besief

@besief since the free UAS database is no longer being offered it is the question where it is going anyhow. As it had a Creative Commons BY license, it might be interesting to keep it going and optimize and update the database. You might want to check your production system.

HaraldWalker avatar Dec 03 '14 17:12 HaraldWalker

Any contributions are welcome. A Pull Request that makes existing pattern more accurate are appreciated. The latest free database can be here (caution: this file is very large and should be opened with a sophisticated editor): https://github.com/before/uadetector/blob/master/modules/uadetector-resources/src/main/resources/net/sf/uadetector/resources/uas.xml

arouel avatar Dec 04 '14 19:12 arouel

@before Do you have tests to validate that regex changes don't result in incorrect results?

HaraldWalker avatar Dec 06 '14 18:12 HaraldWalker

Ha ha, us too. We ran a web vuln scanner on our server, then when it had finished, long after, 100%+ cpu!. Found it was above.

fancellu avatar Dec 07 '14 17:12 fancellu