matomo-log-analytics
                                
                                 matomo-log-analytics copied to clipboard
                                
                                    matomo-log-analytics copied to clipboard
                            
                            
                            
                        import_logs.py fails to detect log type with multiple IPs in first line
When using the X-forwarded-for header for load-balanced sites or proxied traffic, it is possible for the webserver to record multiple IPs on a line. This appears to break the log detection of import_logs.py.
Broken example log:
218.108.232.188, 10.183.250.139 - - [17/Oct/2013:00:33:34 -0400] "GET /blog/ HTTP/1.0" 200 11714 "-" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)"
108.128.162.178 - - [17/Oct/2013:00:33:47 -0400] "GET / HTTP/1.1" 200 8040 "http://www.referringsite.com/news/" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.26 Safari/537.36"
Stack trace:
Traceback (most recent call last):
  File "/var/www/piwik/misc/log-analytics/import_logs.py", line 1575, in <module>
    main()
  File "/var/www/piwik/misc/log-analytics/import_logs.py", line 1539, in main
    parser.parse(filename)
  File "/var/www/piwik/misc/log-analytics/import_logs.py", line 1390, in parse
    format = self.detect_format(file)
  File "/var/www/piwik/misc/log-analytics/import_logs.py", line 1349, in detect_format
    logging.debug('Format %s is the best match', format.name)
AttributeError: 'NoneType' object has no attribute 'name'
While a quick fix is to move any offending lines beyond a "good" line, this is not easily automated.
Modifying log above so script works:
108.128.162.178 - - [17/Oct/2013:00:33:47 -0400] "GET / HTTP/1.1" 200 8040 "http://www.referringsite.com/news/" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.26 Safari/537.36"
218.108.232.188, 10.183.250.139 - - [17/Oct/2013:00:33:34 -0400] "GET /blog/ HTTP/1.0" 200 11714 "-" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)"
I admit, my python-foo is not excellent but I may look over the weekend and try to patch the code. I believe the best option is to catch the error in detection and try the next line.
Migrated from piwik/piwik#4230
Hi Guys,
I've pached my import_logs.py as followed.
My Nginx log format :
log_format  vhosts '$host $http_x_forwarded_for - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent"';
My Common log format
_COMMON_LOG_FORMAT = (
'(?P<ip>[0-9]*.?[0-9]*.?[0-9]*.?[0-9]*)[\,\s?\S+]*\s+\S+\s+(?P<userid>\S+)\s+\[(?P<date>.*?)\s+(?P<timezone>.*?)\]\s+'
'"\S+\s+(?P<path>.*?)\s+\S+"\s+(?P<status>\S+)\s+(?P<length>\S+)'
[0-9]*.?[0-9]*.?[0-9]*.?[0-9]* allows me to prevent the "," to be part of the <ip> regex group and avoid the strict ipv4 regex filter
[\,\s?\S+]* allows me to have none OR multiple IPs and to keep the first one to be mached in piwik.
Hope it can be usefull, don't hesitate to tel me if my workaround is not a proper one
@AlroneRhyn Would you mind creating a pull request wih your change, and ideally updating our tests to show what you've improved? Thanks!