uap-go icon indicating copy to clipboard operation
uap-go copied to clipboard

Regexes sorting is incorrect

Open gkalabin opened this issue 7 years ago • 7 comments

According to specification:

The list of regular-expressions regex shall be evaluated for a given user-agent string beginning with the first regex-item in the list to the last item. The first matching regex stops processing the list. Regex-matching shall be case sensitive.

Here is the proof that sorting of regexes will cause wrong detection results:

package main

import (
    "fmt"
    "log"

    "github.com/uap-go/uaparser"
)

const (
    // specificUA is matched by X and Y, X preceeds Y in regex list
    specificUA = "Opera/9.80 (VRE; Opera Mini/4.2/28.2794; U; en) Presto/2.8.119 Version/11.10"
    // broadUA is matched by Y
    broadUA    = "Opera/9.80 (Windows NT 5.1; U; ru) Presto/2.5.24 Version/10.53"
)

func main() {
    sortThreshold := 100001
    parser, err := uaparser.NewWithOptions("./uap-core/regexes.yaml", uaparser.EUserAgentLookUpMode, sortThreshold, 0, true, true)
    if err != nil {
        log.Fatal(err)
    }

    // specificUA is matched by X, everything is fine
    beforeSort := parser.Parse(specificUA).UserAgent

    // cause regexes sort by parsing broadUA many times: it will cause bubbling up of regex Y
    for i := 0; i < sortThreshold; i++ {
        parser.Parse(broadUA)
    }

    // specificUA is now matched by Y which bubbled up after sort. This causes wrong parsing results
    afterSort := parser.Parse(specificUA).UserAgent
    fmt.Printf("before sort:\t %#v\n after sort:\t %#v\n", beforeSort, afterSort)
}

Result:

$ go run test.go 
2016-08-10 18:21:37.284461949 +0300 MSK Sorting UserAgents slice
before sort:     &uaparser.UserAgent{Family:"Opera Mini", Major:"4", Minor:"2", Patch:""}
 after sort:     &uaparser.UserAgent{Family:"Opera", Major:"11", Minor:"10", Patch:""}

gkalabin avatar Aug 10 '16 15:08 gkalabin

According to the regex list there are two (or may be even more) regexes that match the 'specificUA'. At the beginning (beforeSort := parser.Parse(specificUA).UserAgent) you 'hit' the first one in the list of the regexes, but it doesn't mean there are no more regexes that can match, and the first one returned.

After it you hit some other regex, that is in "lower" index in the list with UA ,that doesn't hit the regex from the first case. This one cause the sort to "pop up" the regex which matched the broadUA to the top of the list.

After it, you are trying to find match for the specificUA once again and now the regex which matches both, specificUA and broadUA is at the top of the list and this one matches.

What's wrong here? So far I can see the correct behaviour - the more "successful" regex was poped up to the top of the list, to reduce the lookup time.

To summarize it I can see the following scenario: specificUA matches regexes A, B and C. At the beginning regex A was returned, because it was at the "higher" position than B and C. During the lookups B was popped up to the first position and now B is being returned as matches regex for specificUA.

Am I missing something?

evgenigourvitch avatar Aug 10 '16 20:08 evgenigourvitch

You're right, regex sorting works in this way. The problem here is that Opera/9.80 (VRE; Opera Mini/4.2/28.2794; U; en) Presto/2.8.119 Version/11.10 corresponds to opera mini, not opera. Before sorting, detection is correct and this UA is detected as opera mini. When list is sorted, then the detector says that it's opera, not opera mini, which in fact is incorrect

gkalabin avatar Aug 11 '16 08:08 gkalabin

This is because the regex '(Opera Mini)(?:/att)?/(\d+).(\d+)' (which is being matched at the first time) located at line 109 in the regexes.yaml and the regex (Opera)/9.80.*Version/(\d+).(\d+)(?:.(\d+))? (which is being matched for the loopped calls, and which affects the scoring) located at line 110. If you will swap them (regexes) you will get the wrong match at this call: beforeSort := parser.Parse(specificUA).UserAgent

As I described in my pull request, my code was supposed to reduce lookup time by "popping up" the most "usable" UA. And I think it does it well (especially on the high loaded cluster (4-6M requests/min))

evgenigourvitch avatar Aug 11 '16 08:08 evgenigourvitch

yes, it's exactly what sorting means. The second call result is indeed incorrect, since input user agent is opera mini, but it's being detected as opera (opera != opera mini).

gkalabin avatar Aug 11 '16 08:08 gkalabin

As a conclusion, sorting is correct, but in some cases, like described before, it can affect the regex match. Unfortunately, Golang doesn't support negative (and positive) lookbehind assertion, so it is not an option to create regex which will exclude (Opera Mini). So currently I can see two solutions:

  1. Not to use sort capability (pass false to the NewWithOptions method).
  2. Make some adjustments to the regexes to prevent this issue.

I'll try to implement some walkaround...

gkalabin, thank you for pointing at this issue.

evgenigourvitch avatar Aug 11 '16 21:08 evgenigourvitch

Cool - thanks for diving in. In the past when we've discussed tradeoffs with performance and accuracy, we've sided with accuracy and readability (e.g. we have a regex-driven yaml file that's relatively easy to edit/understand/consume in any language).

I'm open to sorting things if they don't affect readability (e.g. a change of format is out, but a transpile is OK) and result accuracy / consistency between language implementations.

:-)

On Thu, Aug 11, 2016 at 2:01 PM, evgenigourvitch [email protected] wrote:

As a conclusion, sorting is correct, but in some cases, like described before, it can affect the regex match. Unfortunately, Golang doesn't support negative (and positive) lookbehind assertion, so it is not an option to create regex which will exclude (Opera Mini). So currently I can see two solutions:

  1. Not to use sort capability (pass false to the NewWithOptions method).
  2. Make some adjustments to the regexes to prevent this issue.

I'll try to implement some walkaround...

gkalabin, thank you for pointing at this issue.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ua-parser/uap-go/issues/33#issuecomment-239291231, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfkWvyfJGHXLYjx003nmLdIMB2qGcFTks5qe42ngaJpZM4JhRJe .

elsigh avatar Aug 11 '16 22:08 elsigh