browscap-php
browscap-php copied to clipboard
Parsing difference compared to PHP's get_browser
I was looking at a comparison between some browscap parsers and noticed a useragent that doesn't parse the same way in browscap-php 3.0 and PHP's native get_browser function (php 7.0.15) using the same data file (full file, version 6021).
Here's the UA:
Mozilla/3.0 (compatible; MSIE 3.0; Windows NT 5.0)
(This is probably a contrived UA. It comes from the test suite from this project: https://github.com/faisalman/ua-parser-js That said, I think this is still interesting enough to look at)
Here is the regex/pattern that browscap-php 3.0 decides to use (from browscap.org):
browser_name_regex /^mozilla\/3\.0.*\(.*compatible;.*$/
browser_name_pattern mozilla/3.0*(*compatible;*
(This results in the UA being detected as Netscape on an unknown OS)
Here's the regex/pattern that get_browser decides to use (on my machine):
browser_name_regex ~^mozilla/.* \(compatible; msie 3\.0.*; .*windows nt 5\.0.*$~
browser_name_pattern Mozilla/* (compatible; MSIE 3.0*; *Windows NT 5.0*
(This results in the UA being detected as IE 3.0 on Windows 2000)
Looking at a comparison (http://useragent.mkf.solutions/?userAgent=Mozilla%2F3.0+%28compatible%3B+MSIE+3.0%3B+Windows+NT+5.0%29) it seems that most other parsers agree with get_browser. Why does browscap-php pass on the regex that get_browser decided to use?
I should mention that browscap-php 2 DOES match get_browser's results. Here's a little table showing my results for this agent across several browscap based parsers:
| Parser | Version | Accuracy Score |
|---|---|---|
| php-get-browser | 7.0.15-6021 | 8/8 100% |
| browscap-js-1 | 1.8.6020 | 4/8 50% |
| browscap-php-2 | 2.1.1-6021 | 8/8 100% |
| browscap-php-3-full | 3.0.0-6021 | 4/8 50% |
| crossjoin-1 | 1.0.5-6021 | 4/8 50% |
| crossjoin-2 | v2.0.5-6021 | 8/8 100% |
| crossjoin-3 | v3.0.5-6021 | 8/8 100% |
The ones that don't score "100%" on this are the ones that are parsing this agent as "Netscape" instead of "IE". All parsers in that list should be using the Full file.
browscap-php 3 with the Standard file parses the same as it does with the Full file (I haven't tested get_browser with a different file type). browscap-php 3 with the Lite file doesn't parse this agent at all (as seen on the comparison link above).
The reason for this difference is the caching which was changed between v2 and v3. The v3 caching was adopted from crossjoin v1 and was ported to browscap-js.
@jaydiablo If you want this issue to be fixed you should open an issue in browscap/browscap.
@mimmi20 surely this is an issue in the library though? I haven't looked at the problem yet, but if the same INI is used, then it may be a bug...?
I'd consider it a bug in the parser, IMO.
I'm looking at the code a little bit (I'm somewhat familiar with the strategy behind Crossjoin 1.x as I helped optimize it a bit back in the day). Some hashes are generated from the useragent, starting with the first 9 characters of the agent, and moving down to just the first character, which results in these hashes for this agent:
Array
(
[0] => c0d8e87b248e62d9df520c47cfe27168
[1] => 31d050fd7a4ea6c972063ef30d18991a
[2] => dbeb1c32b66fd7717de583d999f89ec3
[3] => 13e6ce11d0a70e2a5a3df41bf11d493e
[4] => 3a4a9ff7cf86e273442bad1305f3d1fd
[5] => b70924c16a59b9cc2de329464b64118e
[6] => 89364cb625249b3d478bace02699e05d
[7] => 27c9d5187cd283f8d160ec1ed2b5ac89
[8] => 6f8f57715090da2632453988d9a1501b
[9] => d41d8cd98f00b204e9800998ecf8427e
)
Those hashes are then used to fetch a sub-section of patterns from the INI cache, starting with the "most specific" one (the one generated from all 9 characters, which in this case is "mozilla/3") and then moving down the list if necessary to the least specific one (generated from just the character "m" in this case).
If a suitable pattern is found, the other hashes aren't even looked at.
The patterns that are found are sorted by their length, and the longest one that matches the agent wins (patterns shorter than that aren't looked at).
For this agent (and perhaps others) the issue is that the pattern that get_browser chooses is longer than the one that browscap-php 3.0, so technically is more "suitable". The reason that it doesn't get picked by browscap-php is because the pattern lives in the INI cache under the second hash above, here's some output from a grep I did in crossjoin 1.x's data files:
ack "Mozilla\\\/\.\* \\\\\(compatible; MSIE \[\\\d\]\\\\.\[\\\d\]\.\*; \.\*Windows NT \[\\\d\]\\\\.\[\\\d\]\.\*" largebrowscap.patterns.*
largebrowscap.patterns.31
30:31d050fd7a4ea6c972063ef30d18991a 56 Mozilla\/.* \(compatible; OffByOne; Windows.*\) Webster Pro V[\d]\..* Mozilla\/.* \(.*\) AppleWebKit\/.* \(KHTML.* like Gecko\) BingPreview\/.* Mozilla\/.* \(compatible; MSIE [\d]\.[\d].*; .*Windows NT [\d]\.[\d].*Win[\d][\d]. x[\d][\d].*
35:31d050fd7a4ea6c972063ef30d18991a 51 Mozilla\/.* \(compatible; MSIE [\d]\.[\d].*; .*Windows NT [\d]\.[\d].*WOW[\d][\d].*
38:31d050fd7a4ea6c972063ef30d18991a 46 Mozilla\/.\..*\(.*Windows NT [\d]\.[\d].*Win[\d][\d]. x[\d][\d].*\).*Opera.[\d][\d]\.[\d].* Mozilla\/.\..*\(.*Opera.[\d][\d]\.[\d].*Windows NT [\d]\.[\d].*Win[\d][\d]. x[\d][\d].*\).* Mozilla\/.* \(compatible; MSIE [\d]\.[\d].*; .*Windows NT [\d]\.[\d].*
The pattern we want is that very last one on the last line. Notice that the length is "46".
The pattern that browscap-php chooses lives in the "c0d8e87b248e62d9df520c47cfe27168" hash part:
[28] => c0d8e87b248e62d9df520c47cfe27168 23 mozilla/[\d].[\d].*(.compatible;.
Notice that the length is only "23".
In this case, I don't think that "mozilla/*" should be considered less specific than "mozilla/3", but because a matching agent is found in the "mozilla/3" group, the agents in the other groups aren't considered.
I think this needs to be modified to collect all possible patterns for the hash variants, sort them by length, and then test them against the agent. Unfortunately, this may add a fair amount of processing time depending on how many agents are found.
For this particular agent, the "c0d8e87b248e62d9df520c47cfe27168" group contains 34 pattern lines (some lines may contain more than one pattern), of which it would consider 27 (the other 7 are longer than the agent itself). The "31d050fd7a4ea6c972063ef30d18991a" contains 56 pattern lines, "dbeb1c32b66fd7717de583d999f89ec3" group, 21, "13e6ce11d0a70e2a5a3df41bf11d493e" group, 9, "3a4a9ff7cf86e273442bad1305f3d1fd" group, 10, "b70924c16a59b9cc2de329464b64118e" group, 9, etc...
Because of the possibility of having an asterisk within the first 9 characters of a pattern, it seems that looking at all possible hash sections is necessary.
I'll play with it a bit though and see what I find.
@asgrim The difference is the cache creating/using. In browscap v2 the rules are sorted by length only. In browscap v3 first we sort for the part until the first space or regex placeholder, for the complete length second.
So if I modify the parser to find all available patterns that have the right hash + length, sort those by length, and then compare, it does find the right pattern for this particular user-agent, and all of the browscap (6021) tests pass against it, except for one:
CaptiveNetworkSupport-324 wispr
The tests define and browscap-php 3 parses this as: "CaptiveNetworkAgent" on an ios device. This is the pattern selected:
captivenetworksupport*
My modified browscap-php-3 parser parses it as "WISPr" on an unknown platform, using this pattern:
*captivenetworksupport*wispr*
For the record, PHP's get_browser parses this agent in the same way as my modified browscap-php 3.
IMO, the modified browscap-php 3 and get_browser are correct here.
Unfortunately, the modified parser is ~14 seconds slower (41 seconds compared to 27 seconds) running against the browscap test suite than the un-modified parser (on my computer). There may be more optimizations that could be done here.
I suppose that raises the question though, which is the "correct" behavior? Should get_browser and browscap-php 3 return the same results for the same useragent?
If possible and if we dont become too slow there should be the same result.
I was curious how often this might be happening with other user-agents outside of the browscap test suite.
I ran ~47,000 useragents (that live in other parsers' test suites) against browscap-php 3, my modified version of browscap-php 3 and also php 7.0.15's get_browser just to see how many get parsed differently by the 3 different implementations.
- My modified version of
browscap-phpparses 99 of those differently thanbrowscap-php. get_browserparses 215 of them differently thanbrowscap-php.get_browserparses 116 of them differently than my modifiedbrowscap-php(which seems interesting).
Here's a CSV that has the 99 agents that my modified version parsed differently than the official version of browscap-php. It has the selected pattern for each parser for each useragent in the list. For the most part I think the modified version picks a better pattern for the supplied useragent, but this isn't always the case, particularly for ones that being with something other than "Mozilla" and the modified version selects a pattern starting with "*".
I noticed that there are several patterns that the modified version selects that begin with an asterisk that perhaps don't really need to, like these ones:
| browscap-php | modified |
|---|---|
mozilla/5.0 (*linux*) applewebkit/*(khtml* like gecko) *version/4.0*safari/* |
*mozilla/5.0 (*linux*android?4.1*) applewebkit/*khtml* like gecko) version/4.0*safari* |
mozilla/5.0 (*linux*) applewebkit/*(khtml* like gecko) *version/4.0*safari/* |
*mozilla/5.0 (*linux*android?2.3*) applewebkit/*khtml* like gecko) version/4.0*safari* |
mozilla/5.0 (*linux*) applewebkit/*(khtml* like gecko) *version/4.0*safari/* |
*mozilla/5.0 (*linux*android?2.2*) applewebkit/*khtml* like gecko) version/4.0*safari* |
mozilla/5.0 (*linux*) applewebkit/*(khtml* like gecko) *version/4.0*safari/* |
*mozilla/5.0 (*linux*android?4.0*) applewebkit/*khtml* like gecko) version/4.0*safari* |
Perhaps they were added to deal with some odd useragents that don't start with "Mozilla" but still contain the rest of the pattern, I haven't looked at the history of the pattern files at all for insight into this.
At the moment my modified version is ~12 seconds slower (average of 5 iterations) in parsing all of the useragents from the browscap test suite. I haven't looked into modifying the storage of the patterns in the INI cache at all, the only changes I've made so far have been to the actual parsing phase (essentially as I described earlier, grabbing all potential patterns for the useragent (using the same prefix/variant hashing that is already being done) but then sorting all of them by length before testing any of the patterns against the agent. Right now, browscap-php will sort then test all patterns from each variant hash before moving onto the next set of patterns from the next, less specific, variant.
Here's a CSV that has the useragents that get_browser parsed differently than my modified version of browscap-php:
modified_vs_get_browser.csv.zip
Some of that seems like just an ordering difference (like choosing to use *goog* vs. using *java*). These ones stood out to me though:
| UserAgent | modified browscap-php | get_browser |
|---|---|---|
| bingbot | * |
bingbot |
| 360Spider | * |
360spider |
| EasouSpider | * |
easouspider |
| itunes | * |
itunes |
| jBrowser | * |
jbrowser |
| Netscape 4.0 | * |
netscape 4.0 |
| Netscape 4.79 | * |
netscape 4.79 |
| Netscape 7.2 | * |
netscape 7.2 |
| w3m | * |
w3m |
I looked at the different patterns that were considered for "itunes" as an example, and this is what was spit out:
.*robots.*
.*iphone.*
.*libwww.*
.*larbin.*
.*naver.*
.*nokia.*
.*squid.*
.*amiga.*
.*nutch.*
.*java.*
.*mic\/.*
.*ipod.*
.*grub.*
.*ipad.*
.*zeus.*
.*
It seems like that "itunes" pattern just doesn't exist at all in the INI cache. I do believe it's in the INI file though:
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; iTunes
[iTunes]
Parent="DefaultProperties"
Comment="iTunes"
Browser="iTunes"
Browser_Type="Application"
Browser_Bits="32"
Browser_Maker="Apple Inc"
Platform="MacOSX"
Platform_Version=10
Ah, actually, now I see where that is stripped:
https://github.com/browscap/browscap-php/blob/master/src/IniParser/IniParser.php#L149
So that's probably expected (but different behavior from get_browser). I assume what browscap-php is doing here is the expected behavior?
Patterns like itunes should not match anything because this are not a really user agents. These patterns are there to group properties together to keep the ini file small.
See also issue browscap/browscap#3.
@mimmi20 Thanks for the clarification on that, safe to ignore those ones then.
As for the rest that get_browser parses differently, if I modify the ordering of the patterns slightly I can get the list down to 18.
modified_vs_get_browser_updated.csv.zip
The sorting of the patterns was originally "length" descending, "specificity of the variant hash" ascending, then "position in file" ascending.
Changing the last two to descending instead of ascending gets me to that count of 18. I suspect that if I change the order of the grouped patterns (ones that appear on the same line in the "file") that I may be able to get these to match up correctly, not certain yet.
Question though on ordering. I'm digging through the browscap.ini that both libraries are using and am curious where these different patterns appear, to try and determine which library is parsing as expected.
So for the first agent in that CSV file, my modified browscap-php uses this pattern:
mozilla/?.? (compatible*; msie 4.0*mac_powerpc*
Which appears on line 1,580,486 of the browscap FULL file (6021). It occurs in the "[IE Mac]" group which starts on line 1,580,462.
get_browser uses this pattern:
mozilla/* (compatible; msie 4.01*;*mac_powerpc*
Which appears on line 1,578,555 in browscap.ini. It appears in the "[IE 4.01]" group which starts on line 1,578,528.
The second agent in that list, browscap-php uses this pattern:
mozilla/5.0 (*linux*android?2.3*htc?desire* build/*) applewebkit/* (khtml,*like gecko*) version/4.0*safari* which appears on line 1,339,448.
get_browser uses this pattern: mozilla/5.0 (*linux*android?2.3*desire hd build/*) applewebkit/* (khtml,*like gecko*) version/4.0*safari* which appears on line 1,337,333.
Anyhow, I've modified the CSV to include the browscap.ini line number for each pattern used.
modified_vs_get_browser_with_lines.zip
In every case get_browser chooses a pattern that occurs earlier in the file than the pattern that browscap-php chooses.
What is the expected behavior here? Is the ordering of the file significant?
browscap-php doesn't seem to apply any particular ordering to patterns (when building the INI cache) that have the same length+hash, so I assume that they're being stored in the same order they were read from the INI file, but I haven't confirmed that yet.
I should note that running this modified version of browscap-php (even with the sorting changes) results in just one agent not being parsed the same (that "wispr" one) as the official browscap-php.
@jaydiablo For all these useragents I prefer the patterns the get_browser function found, but then I look on the differences between these pattern I think that a cleanup is required. For example: all pattern for the IE on Mac should be in the group [IE Mac].
Could you open an issue for browscap/browscap to cleanup the pattern a little bit and add the 18 useragents there?
The *netfront/3.1*/*netfront/3.5* ones are somewhat interesting and probably not something that can be fixed with any sort of ordering (since one of the optimizations in the pattern cache is to turn digits into [\d]. However, the result could be ambiguous depending on which Netfront we're trying to capture. I can make browscap-php act like PHP's get_browser by taking away the greediness of the ".*" operator when matching (which makes it select the first Netfront/[\d].[\d] in the string rather than the second (which is what is happening right now).
Just as a reminder, here's one of those useragents:
NetFront/3.5.1(BREW 3.1.5; U; en-us; SAMSUNG; NetFront/3.1.5/WAP) Sprint M350 MMP/2.0 Profile/MIDP-2.1 Configuration/CLDC-1.1
browscap-php in its current state matches the second NetFront in that string, but if we take away the greediness, it matches the first. So that could potentially "fix" those UA's, but may have unintended consequences elsewhere (it doesn't change the pass/failure of the browscap 6021 test suite though).
Now, if the *netfront/3.1* pattern occurred earlier in the INI file, get_browser would probably be matching with that pattern rather than the *netfront/3.5* one that it is using currently. I don't think there's a way in browscap-php to deal with both of those situations in the same way, since those digits are removed from our available patterns during the pattern/ini cache building step.
With a slight ordering change on the storage of the pattern cache and this change in the greediness of the .* regex pattern when checking if the pattern returns results, I have the "failing" agents (ones that parse differently from get_browser) down to 6.
@mimmi20 Sure, will do.
I actually hadn't re-run the full list of agents since making some of the sorting tweaks (btw, the sorting that I mention I had to change from ascending to descending, that wasn't actually the case, just an error in how I had them setup initially). The list is at 108 right now (with some of those invalid ones removed). Most of them are of this variety:
Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-5/30.2.004; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/533.4 (KHTML, like Gecko) Safari/525
Which has browscap-php selecting this pattern:
mozilla/5.0 (*symbianos*) applewebkit/* (khtml* like gecko)*safari*
And get_browser selecting this pattern:
mozilla/* (symbianos/*) applewebkit/* (khtml,*like gecko*) safari/*
Which I think is to be expected in this case, since the hash segmentation that browscap-php does would give higher priority to "mozilla/5.0" than it would to "mozilla/*". That said, I think these two patterns may qualify for cleanup, as @mimmi20 suggested, since they're practically identical.
get_browser picks the second pattern because it occurs earlier in the browscap.ini file than the other one does.
Here's the CSV as things stand:
modified_vs_get_browser.csv.zip
I'll extract some of these Useragents and create a ticket for browscap/browscap to see if cleaning them up makes sense.
Aside from those Symbian agents that I already looked at, here's my analysis as to why browscap-php selects a different pattern for the other user agents in that list:
| Agent | browscap-php pattern | get_browser pattern |
|---|---|---|
| Analysis | ||
| Windows-RSS-Platform/2.0 (MSIE 9.0; Windows NT 6.0) | windows-rss-platform/2.0 (msie ?.0; windows nt *.*) | windows-rss-platform/2.0 (msie *; *windows nt 6.0*) |
The one selected by get_browser occurs earlier in the file, but the length of these two patterns aren't the same (in browscap_php's eyes) since the length is calculated after the asterisks are removed. The pattern browscap_php selected is one character longer when the asterisks are removed. | ||
| Mozilla/4.0 (compatible; MSIE 6.0; Windows CE; IEMobile 6.8) SAMSUNG-SGH-i601/WM534 | mozilla/4.0 (compatible; msie 6.0; windows ce; iemobile?6.*)* | mozilla/4.0 (compatible; msie 6.*; windows ce; iemobile?6.8*)* |
| Same issue as the one above (there are 13 agents in the CSV that use these two patterns, so I'm omitting them) | ||
| Mozilla/5.0 (Linux; U; Android 2.2.2; en-us; HUAWEI T8600 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) FlyFlow/1.0 Version/4.0 Mobile Safari/533.1 | mozilla/5.0 (*linux*android?2.2* build/*)*applewebkit/*(*khtml,*like gecko*)*version/4.0*safari* | mozilla/5.0 (*linux*android*) applewebkit/* (khtml,*like gecko*) flyflow/* version/*safari/* |
Even without removing the asterisks the one that browscap_php selects is longer (is when removing the asterisks as well). It does occur later in the file, so it seems that get_browser does no sorting by length and relies solely on the position in the INI file? | ||
| Mozilla/5.0 (Linux; U; Android 2.3.3; de-de; HTCS510e/1.0 Android/2.2 release/06.23.2010 Browser/WAP 2.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1 | mozilla/5.0 (*linux*android?2.3* build/*)*applewebkit/*(*khtml,*like gecko*)*version/4.0*safari* | mozilla/5.0 (*linux*android?2.2* build/*)*applewebkit/*(*khtml,*like gecko*)*version/4.0*safari* |
Lengths are the same, the one that get_browser selects occurs earlier in the file, so why did browscap-php select the other one? The only difference here is the version (2.2 vs. 2.3). The useragent itself has both of those versions in it which I think triggers that greedy vs. not-greedy match that I mentioned earlier. If I were to switch my copy of browscap-php back to greedy matching, I assume this agent wouldn't show up (but those Netfront ones would). As usual, the one that get_browser selected occurs earlier in the file. | ||
| Mozilla/5.0 (Linux; U; Android 2.3.7; de-de; HTC DESIRE HD Build/GRI40; SUNDAWG CM7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1 | mozilla/5.0 (*linux*android?2.3*htc?desire* build/*) applewebkit/* (khtml,*like gecko*) version/4.0*safari* | mozilla/5.0 (*linux*android?2.3*desire hd build/*) applewebkit/* (khtml,*like gecko*) version/4.0*safari* |
The one that browscap-php selected is one character longer after asterisks are removed even though it does occur later in the INI file. | ||
| Mozilla/5.0 (Linux; Android 3.1; pt-PT; MZ606 Build/UMWB8E) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19 | mozilla/5.0 (*linux*android?3.1*) applewebkit/* (khtml* like gecko) chrome/*safari/* | mozilla/5.0 (*linux*android*) applewebkit/* (khtml* like gecko) chrome/18.*safari/* |
Again, browscap-php selected one is 1 character longer even though it's later in the file. | ||
| Mozilla/5.0 (Linux; Android 3.2; en-US; SHW-M305W Build/P2FHU4) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19 | mozilla/5.0 (*linux*android?3.2*) applewebkit/* (khtml* like gecko) chrome/*safari/* | mozilla/5.0 (*linux*android*) applewebkit/* (khtml* like gecko) chrome/18.*safari/* |
| Nearly identical situation as the one above, just a different version of Android. | ||
I don't think any code fixes in the parser will fix these agents since it seems like more of a fundamental difference in how the two parsers interpret the file (aside from the greedy/not-greedy matching one, which should probably come down to performance, which I'll try to look at). Re-ordering of some of these in the browscap.ini file itself would probably fix them (so that the "longer" patterns appear earlier in the file), but I'm not sure how the order in the INI file is determined currently, so not sure if that violates something else.
After some more testing, and a bit of a look at the php source for the get_browser function I've got browscap-php parsing that list of useragents as close as I think I can make it without incurring too much of a penalty on performance (though there is a penalty).
get_browser appears to go through the INI file in order, and tries to quickly exclude any patterns that it can, but it doesn't stop when it does find a pattern (unless that pattern is an identical match to the agent, not using regex patterns at all (https://github.com/php/php-src/blob/c8aa6f3a9a3d2c114d0c5e0c9fdd0a465dbb54a5/ext/standard/browscap.c#L596)). Instead, it continues through the file and if it finds another match then it compares the length of the two patterns. The pattern that is longer with the two placeholder characters ("?" and "*") removed is preferred (https://github.com/php/php-src/blob/c8aa6f3a9a3d2c114d0c5e0c9fdd0a465dbb54a5/ext/standard/browscap.c#L614), and that continues until the pattern list is exhausted (it may short circuit if any subsequent patterns are shorter, without actually running regular expression matches against them once it has a match, I didn't dig much further).
browscap-php (from crossjoin) seems to simulate this by sorting the patterns first by length and then by the length with those two characters removed (only if they had the same length).
https://github.com/browscap/browscap-php/blob/master/src/IniParser/IniParser.php#L275
This, however, doesn't take into account the patterns' position in the INI file, which we've seen can cause quite a bit of differences when we compare the output of the two parsers.
To combat this, I've tried to come up with a way to retain the position in the INI file, but also be able to group and sort by pattern length (with and without the characters removed).
The method I have in place has the difference list down to 16 strings. 3 of them are greedy/non-greedy matching issues with version numbers that can't really be resolved unless the digit replacement stuff is removed from browscap-php (this is a really nice optimization though, and removing it slows down the parser quite a bit for certain useragent strings).
The other 13 are all matched by the same 2 patterns (one in browscap-php and the other in get_browser).
Here's one of those strings and the two patterns that are selected by the two parsers:
Mozilla/4.0 (compatible; MSIE 6.0; Windows CE; IEMobile 6.8) PPC; 240x320; MDA Vario/1.3 Profile/MIDP-2.0 Configuration/CLDC-1.1
browscap-php's pattern: mozilla/4.0 (compatible; msie 6.0; windows ce; iemobile?6.*)*
get_browser's pattern: mozilla/4.0 (compatible; msie 6.*; windows ce; iemobile?6.8*)*
The pattern that get_browser uses is one character longer than the one that my test browscap-php branch selects, and it appears earlier in the browscap.ini file.
However, with the special characters removed, they're both 58 characters long, so it seems like we should fall back to the position in the INI file being the reason that one gets picked over the other. However, that didn't happen in this case, why?
It turns out that the digit replacement that browscap-php does is the reason for the mismatch. Earlier in the browscap.ini file (before the pattern that get_browser selected) there's this pattern:
mozilla/4.0 (compatible; msie 6.0; windows ce; iemobile?8.*)*
Notice that this pattern is identical to the one that browscap-php selected in every way except for that "8" at the end (the one browscap-php selected has a "6" in that spot). However, because of the way that browscap-php removes and later replaces digits, both of these patterns ultimately become this:
mozilla\/[\d]\.[\d] \(compatible; msie [\d]\.[\d]; windows ce; iemobile.[\d]\..*\).*
Which actually "promotes" the one with a 6 in that spot to an "earlier" spot in the INI file (since duplicates after the digit "compression" aren't stored twice: https://github.com/browscap/browscap-php/blob/master/src/IniParser/IniParser.php#L200) than the one that get_browser selected. Earlier INI file position wins.
I don't think there's an easy way to deal with this case while keeping the digit replacement stuff in place. That said, it seems like this situation is pretty rare, and could possibly be resolved by ordering the INI file differently.
Anyhow, the performance of this branch as it stands is a bit slower than some of the previous tests I've done (which were slower than the current 3.0 release of browscap-php, so I guess a decision would have to be made for "accuracy" vs. performance here. I'm planning to publish the different branches I've tested with to my fork, once I do that I'll reference them in this issue.
This is the branch that I've put together that has the work I've done to try and get browscap-php to match get_browser.
https://github.com/jaydiablo/browscap-php/tree/accuracy-test-dual-length
As mentioned above, there are still 16 agents of that ~47,000 list that parse differently, due to the reasons mentioned in my last comment.
Performance wise, it processes the agents from the browscap/browscap test suite in 52 seconds compared to 27 seconds. I've tried different techniques and optimizations to get this to be faster without affecting the "accuracy" but I think I've reached my limit (if anybody sees anything that looks like a performance win, let me know!).
Here's a profile on blackfire.io: https://blackfire.io/profiles/fe225dfd-c131-4e64-8d29-eeab4ba990aa/graph
Here's a profile of browscap-php 3.0 unmodified for comparison: https://blackfire.io/profiles/b529ebf3-c746-4d7a-a5f2-aaab254ba0a6/graph
(both of those are against just 100 useragents)
Here's a diff of the useragents that browscap-php parses differently than get_browser vs. the agents that my branch parses differently than get_browser (my branch on the right):
https://www.diffchecker.com/nWegCJ5L
And FWIW, get_browser (PHP 7.0.15+, anything earlier isn't worth mentioning since it's so slow) takes 113 seconds to parse the agents in the browscap test suite, so still a win there.
I have some other tests that I've done in different branches, I'll probably publish those as well just to see if maybe there's a middle ground for performance vs. accuracy.
Ultimately it's up to the maintainers which direction this should go. IMHO, it's a parser bug to prefer a shorter pattern because it doesn't contain a "*" or "?" in the first ~10 characters when there are perhaps more suitable patterns (i.e. longer) that do have one of those characters early on in the pattern.
Being more interchangeable with get_browser may also be attractive, especially since the performance improvements in recent PHP versions have made it more of a viable option than before (but still slower ;) ).
Anyhow, let me know what you think, especially if you see any improvements on the performance front.
I've been able to shave another ~10 seconds off that parse time for the browscap test suite agents, which puts this branch at about 43 seconds vs. the original's 27 seconds. Memory use was a bit lower still, but not by much.
I'll clean up and publish soon.
This branch (https://github.com/jaydiablo/browscap-php/tree/accuracy-test-dual-length-word-filter) extends on the work I did earlier to get the accuracy of the parser down to only 13 differing useragents (same as before).
It adds a word filtering step that quickly reduces the applicable pattern list by extracting the longest word (with a few exceptions like "mozilla" and "android") from the pattern at cache creation time and using strpos to check if the useragent contains that word when it's collecting the patterns from the cache. The patterns are now grouped by hash, sortLength, minLength and longest word so it's possible to ignore an entire group of patterns if they all share the same word and the user agent does not contain that word.
With this particular optimization in place I can get the parsing to just +6 seconds over the original browscap-php (~27 seconds compared to ~33 seconds), that said, the original could probably benefit from the same optimization, which would move the goal posts a bit.
I tried bumping this to two words, but that ultimately causes the patterns to spread out among more groups which increases the amount of work that the parser has to do when it's collecting the patterns (I've tried other variations too, by extracting all words > 4 characters for example, but the segmentation of the patterns costs more than the word filtering benefits it seems). Two words increased the parse time by a couple of seconds.
I'll need to benchmark this branch on a magnetic disk computer, as it may have increased file reads compared to before, which my SSD drive may be hiding the impact of.
I ran all three (get_browser, browscap-php 3.0 and my branch of browscap-php) against 50,000 unique "real world" useragents (from our logs over past 30 days).
browscap-php 3 parsed 94 differently than get_browser did.
My branch parsed 1 differently than get_browser did.
kindle
Time taken:
get_browser: 904.566sbrowscap-php3.0: 302.514s- my branch of
browscap-php: 327.988s
Are these useragents part of the list you posted before?
@mimmi20 No, this is a different list, the 50,000 was taken from our server logs for the last 30 days. The other ~47,000 list was taken from test suites of other parsers (including browscap's).
@jaydiablo May we close this issue?
I'm not sure, without modifying the parser (which hasn't been done), the issue will always remain, and would require constant maintenance to compare browscap-php and get_browser on new useragents as they're found.
I don't know off-hand how often this happens with current useragents, but I could look at some comparisons of recent lists that I have collected.