matomo-log-analytics icon indicating copy to clipboard operation
matomo-log-analytics copied to clipboard

Useragent exclude doesn't work

Open jloh opened this issue 8 years ago • 1 comments

I'm using the latest version of the log_import.py script and I can't seem to get the --useragent-exclude flag to work.

I have log lines like this:

get.geojs.io 52.86.xxx.xx - - [24/Aug/2017:16:31:35 +1000] "GET /v1/ip/country.js HTTP/1.1" "-" 200 100 "-" "loader.io;8521b6348e76580727xxxxxxxxx" "US" "0.001"

And have a command line like this:

python import_logs.py --dry-run --url=$piwik_domain --idsite=$site_id --enable-http-errors --enable-http-redirects --enable-static --enable-bots --enable-reverse-dns --useragent-exclude='loader.io*' --token=$piwik_token --hostname=get.geojs.io --hostname=ipv4.geojs.io --hostname=ipv6.geojs.io --log-format-regex='((?P<host>\S+) (?P<ip>\S+) \S+ \S+ \[(?P<date>.*?) (?P<timezone>.*?)\] \"\S+ (?P<path>.*?) \S+\" \"\S+\" (?P<status>\S+) (?P<length>\S+) \"(?P<referrer>.*?)\" \"(?P<user_agent>.*?)\") \"\S+\" \"(?P<generation_time_milli>\S+)\"' $log_file

However the useragent doesn't get excluded. I've also tried just --useragent-exclude='loader.io' but no luck, along with --useragent-exclude='loader.io;8521b6348e76580727xxxxxxxxx' returning the same result.

My regex seems to work fine (https://regex101.com/r/I0s5kq/1) in detecting the user agent, it just seems like the script isn't excluding it from the imports?

jloh avatar Aug 24 '17 07:08 jloh

There are two things to note here:

  1. Matching is not done by a regular expression but by substring. The correct value would be one of the two options you tried without the asterisk (e.g. loader.io)
  2. --enable-bots manipulates the behaviour of the exclusion feature (see here)

If you remove the --enable-bots flag and use an agent substring for matching you should get the desired result of having those results excluded. Just removing the asterisk from your command should however detect the "excluded" agents as bots instead of regular users.

As this probably sounds a bit confusing it might be a thing to enhance the script to have separate flags for "this is a bot agent" and "completely ignore this agent".

mneudert avatar Aug 29 '17 18:08 mneudert