device-detector icon indicating copy to clipboard operation
device-detector copied to clipboard

Commercial model returned when parsing User Agent string

Open csanclop opened this issue 5 years ago • 17 comments

Dear Matomo Team,

I have experimented with device_detector 0.10 version some strange behaviour with Python port library.

When I parse the following user-agent string: Mozilla/5.0 (Linux; Android 6.0.1; SM-G532G Build/MMB29T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.137 Mobile Safari/537.36

  • If I try to retrieve the model field information, what I retrieve is the following commercial model response: GALAXY J2 Prime
  • What I expected to see is the raw model found in user-agent string: SM-G532G

The following is the code I run to perform the test:

image

image

Thank you very much in advance!

Kind regards.

csanclop avatar May 01 '20 15:05 csanclop

It's currently not possible to return the raw model if a commercial model is defined

sgiehl avatar May 01 '20 18:05 sgiehl

Dear @sgiehl and @mattab , From Matomo Team do you contemplate a change request in your product roadmap in order to generate another field named key_model with the value in raw UserAgent (SM-G532G) ? In our use case we are interested to identify models with this raw value. If so, can I collaborate with a pull request to original Matomo project to add this new feature?

csanclop avatar May 04 '20 07:05 csanclop

that's wouldn't be too easy to implement. Many device are using there "commercial model" in the useragent. Some have multiple model versions, that we group together to the same "commercial model". e.g. https://github.com/matomo-org/device-detector/blob/e012536928a9632efafebc01eeac0da4258b4468/regexes/device/mobiles.yml#L9759-L9760 splitting those up, would end up in a lot more detection rules.

What exactly do you need those raw model names for?

sgiehl avatar May 04 '20 11:05 sgiehl

Dear @sgiehl ,

We need the raw models because we are trying to correlate device data received from TAC database used by telecom. carriers (like http://tacdb.osmocom.org/ or https://imeidb.gsma.com/imei/index#) with the information received from useragent string. But we cannot correlate and compare the data because the nomenclature of models in these databases is different.

With the example you have provided, certainly I see some mix of raw models and also commercial models in user-agent string. So, one problem is that we are dealing with unstructured data. Another problem I see is useragent is not standard, then some providers like Apple can cause some difficulties to identify the exact iPhone model version.

But regarding the performance issue you mention, if we are interested to return raw model, you assert split these regex rules one-to-one would lead to a lot more detection rules. But what is the overhead in terms of performace? I understand when you write a single pattern, obviously, the string is parsed only once, but I am not sure about magnitude order of overhead time costs. Because, actually if you write a single pattern, you have to compare the input string anyway with every raw model: regex: '(?:SAMSUNG-)?(?:GT-I9500|GT-I9502|GT-I9505|SCH-I545|SCH-I959|SCH-R970|GALAXY-S4|SGH-M919N?)'

Can be affected performance of application substantially ?

Thank you in advance Stefan!

Kind regards.

csanclop avatar May 08 '20 21:05 csanclop

regex: '(?:SAMSUNG-)?(?:GT-I9500|GT-I9502|GT-I9505|SCH-I545|SCH-I959|SCH-R970|GALAXY-S4|SGH-M919N?)'

If you would rewrite this rule to (?:SAMSUNG-)?(GT-I9500|GT-I9502|GT-I9505|SCH-I545|SCH-I959|SCH-R970|GALAXY-S4|SGH-M919N?) you could use $1 to catch the raw model without splitting the rule.

mimmi20 avatar May 09 '20 10:05 mimmi20

Exactly @mimmi20 , it is not necessary split the rule, because you receive raw model as an input $1. We can generate a new field named 'raw_model' and assign $1, and mantain of course current 'model' field generated with the regular expression.

The question now is, as a Matomo Team do you think this is a feature useful for other stakeholders using the library and we can add this 'raw_model' in Matomo official repository? Or it is better I develop this new feature in a forked branch right now ?

Thank you!

csanclop avatar May 09 '20 12:05 csanclop

I'm not against returning an additional field that contains the actual match. But I think it will be a lot work to go through all detections and adjust them, so $1 always contains the raw model. Also there might be cases where the regex doesn't match the full raw model, as there might be some additional characters in the useragent, but not in the regex. Like GT-I9500 also matches on GT-I9500A

sgiehl avatar May 09 '20 12:05 sgiehl

OK @sgiehl and @mimmi20 , if I understood correctly, actual match ($1) is not returned actually by the library (it is only an internal variable), but it is possible to modify the code in a forked branch in order to return $1 variable without the need to modify any rule nor create any new field ? Is it correct ?

Thank you!

csanclop avatar May 09 '20 14:05 csanclop

I have another suggestion, so as not to edit thousands of regular expressions and tests, I suggest:

you can create an additional method for browsers, but you will have to make your own regular expression for all browsers

example ----------- version ---- lang ------- hardware name Android (?:[\d.]+;)\s?(?:[^;]+;)?\s?([^\.\)]+)(?: Build.+|\)) AppleWebKit

test ua Mozilla/5.0 (Linux; U; Android 8.1.0; zh-cn; PBCT10 Build/OPM1.171019.011) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.132 MQQBrowser/8.9 Mobile Safari/537.36

sanchezzzhak avatar May 10 '20 21:05 sanchezzzhak

@csanclop No, you would need to go through all regexes and check them, there are a lot, that do not include any specific matches, or where only some part is matched to extend the model, ...

See https://github.com/matomo-org/device-detector/blob/e012536928a9632efafebc01eeac0da4258b4468/regexes/device/mobiles.yml#L26-L29

https://github.com/matomo-org/device-detector/blob/e012536928a9632efafebc01eeac0da4258b4468/regexes/device/mobiles.yml#L424-L426

https://github.com/matomo-org/device-detector/blob/e012536928a9632efafebc01eeac0da4258b4468/regexes/device/mobiles.yml#L4867-L4868

Those, and a lot more would need to be adjusted, that $1 returns the raw model and other matches are used to build the model.

sgiehl avatar May 11 '20 07:05 sgiehl

Dear @sgiehl ,

Is it possible to consider @sanchezzzhak suggestion, in order to avoid redo thousands of regular expressions and tests ?

Kind regards.

csanclop avatar May 11 '20 09:05 csanclop

That would mean doing additional regexes that matches specific browser user agents. That could be done in an additional parser

sgiehl avatar May 11 '20 12:05 sgiehl

I created a mini prototype, but not all options and all possible options are made https://github.com/sanchezzzhak/device-detector/blob/6267/Parser/AliasDevice.php

fixture file https://github.com/sanchezzzhak/device-detector/blob/6267/regexes/alias_devices.yml

test class https://github.com/sanchezzzhak/device-detector/blob/6267/Tests/Parser/AliasDeviceTest.php

test fixture file https://github.com/sanchezzzhak/device-detector/blob/6267/regexes/alias_devices.yml

I will be happy if you have any ideas on how to call the class by a more correct name

sanchezzzhak avatar May 11 '20 20:05 sanchezzzhak

Thank you very much @sanchezzzhak !

This prototype is excellent ! I think the name when you call the class (alias_devices) is correct. Now I have an idea about how are these Matomo parsers, because it is first time I am dealing with this library.

This is a working solution, right ?

Kind regards.

csanclop avatar May 11 '20 21:05 csanclop

you can try using it and report problems.

use DeviceDetector\Parser\AliasDevice;

$userAgent = $_SERVER['HTTP_USER_AGENT'] ?? '';
$parser = new AliasDevice;
$parser->setUserAgent($userAgent);
$result = $parser->parse();
 var_dump($result);

// result empty array or ['name' => 'model raw name']

sanchezzzhak avatar May 12 '20 10:05 sanchezzzhak

This works fine @sanchezzzhak Thank you so much !

Kind regards.

csanclop avatar May 12 '20 14:05 csanclop

Thank you @sgiehl and @mimmi20 to help me understand the problem and understand these regex and parsers behind the scenes.

Kind regards.

csanclop avatar May 12 '20 14:05 csanclop