matomo-log-analytics icon indicating copy to clipboard operation
matomo-log-analytics copied to clipboard

import_logs.py fails to detect log type with multiple IPs in first line

Open anonymous-matomo-user opened this issue 12 years ago • 2 comments

When using the X-forwarded-for header for load-balanced sites or proxied traffic, it is possible for the webserver to record multiple IPs on a line. This appears to break the log detection of import_logs.py.

Broken example log:

218.108.232.188, 10.183.250.139 - - [17/Oct/2013:00:33:34 -0400] "GET /blog/ HTTP/1.0" 200 11714 "-" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)"
108.128.162.178 - - [17/Oct/2013:00:33:47 -0400] "GET / HTTP/1.1" 200 8040 "http://www.referringsite.com/news/" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.26 Safari/537.36"

Stack trace:

Traceback (most recent call last):
  File "/var/www/piwik/misc/log-analytics/import_logs.py", line 1575, in <module>
    main()
  File "/var/www/piwik/misc/log-analytics/import_logs.py", line 1539, in main
    parser.parse(filename)
  File "/var/www/piwik/misc/log-analytics/import_logs.py", line 1390, in parse
    format = self.detect_format(file)
  File "/var/www/piwik/misc/log-analytics/import_logs.py", line 1349, in detect_format
    logging.debug('Format %s is the best match', format.name)
AttributeError: 'NoneType' object has no attribute 'name'

While a quick fix is to move any offending lines beyond a "good" line, this is not easily automated.

Modifying log above so script works:

108.128.162.178 - - [17/Oct/2013:00:33:47 -0400] "GET / HTTP/1.1" 200 8040 "http://www.referringsite.com/news/" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.26 Safari/537.36"
218.108.232.188, 10.183.250.139 - - [17/Oct/2013:00:33:34 -0400] "GET /blog/ HTTP/1.0" 200 11714 "-" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)"

I admit, my python-foo is not excellent but I may look over the weekend and try to patch the code. I believe the best option is to catch the error in detection and try the next line.

Migrated from piwik/piwik#4230

anonymous-matomo-user avatar Oct 18 '13 16:10 anonymous-matomo-user

Hi Guys,

I've pached my import_logs.py as followed.

My Nginx log format : log_format vhosts '$host $http_x_forwarded_for - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent"';

My Common log format _COMMON_LOG_FORMAT = ( '(?P<ip>[0-9]*.?[0-9]*.?[0-9]*.?[0-9]*)[\,\s?\S+]*\s+\S+\s+(?P<userid>\S+)\s+\[(?P<date>.*?)\s+(?P<timezone>.*?)\]\s+' '"\S+\s+(?P<path>.*?)\s+\S+"\s+(?P<status>\S+)\s+(?P<length>\S+)'

[0-9]*.?[0-9]*.?[0-9]*.?[0-9]* allows me to prevent the "," to be part of the <ip> regex group and avoid the strict ipv4 regex filter [\,\s?\S+]* allows me to have none OR multiple IPs and to keep the first one to be mached in piwik.

Hope it can be usefull, don't hesitate to tel me if my workaround is not a proper one

AlroneRhyn avatar Feb 09 '17 15:02 AlroneRhyn

@AlroneRhyn Would you mind creating a pull request wih your change, and ideally updating our tests to show what you've improved? Thanks!

mattab avatar Feb 20 '17 08:02 mattab