goatcounter
goatcounter copied to clipboard
Apache combined log import throws many "path: must be set, must be longer than 1 characters" errors
Hi,
I am setting up my own goatcounter installation based on apache log files. The log files are the standard "combined" format. ./goatcounter import -format=combined
works fine, apart from that some (non-standard looking) requests cause import errors. Here are some examples:
13:16:32 ERROR: error processing line 4740 {error="path: must be set, must be longer than 1 characters." line="0.0.0.0 - - [08/May/2022:01:52:08 +0100] \"OPTIONS * HTTP/1.1\" 200 - \"-\" \"-\"" lineno=4740}
13:16:32 ERROR: error processing line 4750 {error="path: must be set, must be longer than 1 characters." line="0.0.0.0 - - [08/May/2022:01:52:19 +0100] \"GET %. HTTP/1.1\" 400 226 \"-\" \"-\"" lineno=4750}
13:16:32 ERROR: error processing line 4501 {error="path: must be set, must be longer than 1 characters." line="0.0.0.0 - - [07/May/2022:18:31:17 +0100] \"GET /some/slightly/longer/path.html HTTP/1.1\" 304 - \"-\" \"Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)\"" lineno=4501}
combined is defined as LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
in /etc/httpd/conf/httpd.conf
.
Could you please investigate this?
Do you have a few example lines from your log where it fails? Otherwise I need to set up all of Apache to test this, which is a bit of work.
Thanks for getting to this - here are some example failing log lines:
0.0.0.0 - - [08/May/2022:01:52:08 +0100] "OPTIONS * HTTP/1.1" 200 - "-" "-"
0.0.0.0 - - [08/May/2022:01:52:19 +0100] "GET %. HTTP/1.1" 400 226 "-" "-"
0.0.0.0 - - [07/May/2022:18:31:17 +0100] "GET /some/slightly/longer/path.html HTTP/1.1" 304 - "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
0.0.0.0 - - [06/Jun/2022:15:42:32 +0100] "-" 408 - "-" "-"
0.0.0.0 - - [06/Jun/2022:15:24:13 +0100] "PRI * HTTP/2.0" 400 226 "-" "-"
Thanks, there's a few different issues:
0.0.0.0 - - [08/May/2022:01:52:08 +0100] "OPTIONS * HTTP/1.1" 200 - "-" "-"
Here the response size is a -
rather than a number (the 3rd from the right), and it only checks for \d+
. It's a simple fix.
0.0.0.0 - - [08/May/2022:01:52:19 +0100] "GET %. HTTP/1.1" 400 226 "-" "-"
0.0.0.0 - - [06/Jun/2022:15:24:13 +0100] "PRI * HTTP/2.0" 400 226 "-" "-"
The path on these (%.
and *
) don't start with a /
. Not entirely sure what the %.
in there means? Or what a PRI
request is?
You probably want to filter out most non-GET requests anyway with -exclude '!method:GET'
, which should take care of that PRI
request (whatever it may be).
0.0.0.0 - - [06/Jun/2022:15:42:32 +0100] "-" 408 - "-" "-"
Here there's also no HTTP verb, path, or HTTP version; not sure what to do with this? I'm confused how/why this gets logged too 🤔
Right now fields are always mandatory and you can't really specify an optional field; the format of them is also somewhat strict on purpose, to prevent the wrong fields from accidentality being used. The problem here is that Apache treats verb + path + HTTP as a single "word" and puts -
in there, whereas GoatCounter expects them to be three different fields, so meh.
I think I'll improve the scanner a bit to reject this as an invalid line with a better error message.
0.0.0.0 - - [07/May/2022:18:31:17 +0100] "GET /some/slightly/longer/path.html HTTP/1.1" 304 - "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
This one seems to work for me? I don't get an error and see a pageview recorded in the database. It is (correctly) marked a bot though, so it doesn't show in the UI.
0.0.0.0 - - [06/Jun/2022:15:42:32 +0100] "-" 408 - "-" "-"
408 is a request timeout; so I guess what happened is that someone established a TCP connection to your server but never sent any data, hence there's nothing to log.
Those other two GET %.
and PRI *
are probably people sending junk data, maybe to scan for specific security problems in some webapps or the like. Looking at the Apache docs the %r
just logs the first request line.
Anyhow, all of these are just bad/junk requests. The error messaging on this could be improved a bit, but I'm not sure if there's anything more we can/should do with this; it intentionally filters as little as possible by default so people have the flexibility to do whatever with it; if you want to use goatcounter to track WebDav requests or do something else crazy then you can!
I could maybe add a new special -exclude junk
keyword to filter this kind of stuff though, similar to how you can already do -exclude static
to automatically exclude most static files.
I fixed the outright bugs last year, and I'm not sure anything more needs to be done as I never got a reply? Closing anyway – can always reopen if more work needs to be done.