Lots of regular expressions in the resources module seem to be poorly optimized
Hello everyone,
we have been using uadetector in production for a while now and always noticed how it soaked up most of our CPU time (we categorize about 3k user agents a second at peak times). Now I'm taking a look into this and from what I see, it seems like there isn't any particular effort put into the regex patterns that are being used.
Just some random examples:
device_reg id 95 has (P1000|P1010|P3100|P3105|P3110|P3113|P5100|P5110|P5113|P5200|P5210|P6200|P6201|P6210|P6211|P6800|P6810|P7110|P7300|P7310|P7320|P7500|P7510|P7511), where at least the P could be pulled out.
device_reg 115 has (672\.0\.2|672\.0\.8|672\.1\.12|672\.1\.13|672\.1\.14|672\.1\.15) where the 672. could be pulled out first and then 1.1 in one and 0. in another sub-expression.
os 34 has three patterns registered, that could easily be combined into one
<operating_system_reg>
<order>209</order>
<os_id>34</os_id>
<regstring>/NokiaN97/si</regstring>
</operating_system_reg>
<operating_system_reg>
<order>210</order>
<os_id>34</os_id>
<regstring>/Nokia.*XpressMusic/si</regstring>
</operating_system_reg>
<operating_system_reg>
<order>211</order>
<os_id>34</os_id>
<regstring>/NokiaE66/si</regstring>
</operating_system_reg>
And generally there are lots of .*s (especially in the device category patterns), which will backtrack a lot.
Now I am not sure if this is kind of intentional, to achieve better readability and clarity, a kind of documentary purpose (although there are some really glaring mistakes like /.*windows 95.*/si), expecting people to mostly supply their own, customized uas.xml, or whether there just hasn't been anyone interested in optimizing all those regexs. If it is the latter, I might just take a shot at it.
@besief since the free UAS database is no longer being offered it is the question where it is going anyhow. As it had a Creative Commons BY license, it might be interesting to keep it going and optimize and update the database. You might want to check your production system.
Any contributions are welcome. A Pull Request that makes existing pattern more accurate are appreciated. The latest free database can be here (caution: this file is very large and should be opened with a sophisticated editor): https://github.com/before/uadetector/blob/master/modules/uadetector-resources/src/main/resources/net/sf/uadetector/resources/uas.xml
@before Do you have tests to validate that regex changes don't result in incorrect results?
Ha ha, us too. We ran a web vuln scanner on our server, then when it had finished, long after, 100%+ cpu!. Found it was above.