goatcounter icon indicating copy to clipboard operation
goatcounter copied to clipboard

Apache combined log import throws many "path: must be set, must be longer than 1 characters" errors

Open watercrossing opened this issue 2 years ago • 4 comments

Hi,

I am setting up my own goatcounter installation based on apache log files. The log files are the standard "combined" format. ./goatcounter import -format=combined works fine, apart from that some (non-standard looking) requests cause import errors. Here are some examples:

13:16:32 ERROR: error processing line 4740 {error="path: must be set, must be longer than 1 characters." line="0.0.0.0 - - [08/May/2022:01:52:08 +0100] \"OPTIONS * HTTP/1.1\" 200 - \"-\" \"-\"" lineno=4740}
13:16:32 ERROR: error processing line 4750 {error="path: must be set, must be longer than 1 characters." line="0.0.0.0 - - [08/May/2022:01:52:19 +0100] \"GET %. HTTP/1.1\" 400 226 \"-\" \"-\"" lineno=4750}
13:16:32 ERROR: error processing line 4501 {error="path: must be set, must be longer than 1 characters." line="0.0.0.0 - - [07/May/2022:18:31:17 +0100] \"GET /some/slightly/longer/path.html HTTP/1.1\" 304 - \"-\" \"Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)\"" lineno=4501}

combined is defined as LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined in /etc/httpd/conf/httpd.conf.

Could you please investigate this?

watercrossing avatar May 13 '22 12:05 watercrossing

Do you have a few example lines from your log where it fails? Otherwise I need to set up all of Apache to test this, which is a bit of work.

arp242 avatar Jun 06 '22 15:06 arp242

Thanks for getting to this - here are some example failing log lines:

0.0.0.0 - - [08/May/2022:01:52:08 +0100] "OPTIONS * HTTP/1.1" 200 - "-" "-"
0.0.0.0 - - [08/May/2022:01:52:19 +0100] "GET %. HTTP/1.1" 400 226 "-" "-"
0.0.0.0 - - [07/May/2022:18:31:17 +0100] "GET /some/slightly/longer/path.html HTTP/1.1" 304 - "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
0.0.0.0 - - [06/Jun/2022:15:42:32 +0100] "-" 408 - "-" "-"
0.0.0.0 - - [06/Jun/2022:15:24:13 +0100] "PRI * HTTP/2.0" 400 226 "-" "-"

watercrossing avatar Jun 06 '22 15:06 watercrossing

Thanks, there's a few different issues:

0.0.0.0 - - [08/May/2022:01:52:08 +0100] "OPTIONS * HTTP/1.1" 200 - "-" "-"

Here the response size is a - rather than a number (the 3rd from the right), and it only checks for \d+. It's a simple fix.

0.0.0.0 - - [08/May/2022:01:52:19 +0100] "GET %. HTTP/1.1" 400 226 "-" "-"
0.0.0.0 - - [06/Jun/2022:15:24:13 +0100] "PRI * HTTP/2.0" 400 226 "-" "-"

The path on these (%. and *) don't start with a /. Not entirely sure what the %. in there means? Or what a PRI request is?

You probably want to filter out most non-GET requests anyway with -exclude '!method:GET', which should take care of that PRI request (whatever it may be).

0.0.0.0 - - [06/Jun/2022:15:42:32 +0100] "-" 408 - "-" "-"

Here there's also no HTTP verb, path, or HTTP version; not sure what to do with this? I'm confused how/why this gets logged too 🤔

Right now fields are always mandatory and you can't really specify an optional field; the format of them is also somewhat strict on purpose, to prevent the wrong fields from accidentality being used. The problem here is that Apache treats verb + path + HTTP as a single "word" and puts - in there, whereas GoatCounter expects them to be three different fields, so meh.

I think I'll improve the scanner a bit to reject this as an invalid line with a better error message.

0.0.0.0 - - [07/May/2022:18:31:17 +0100] "GET /some/slightly/longer/path.html HTTP/1.1" 304 - "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

This one seems to work for me? I don't get an error and see a pageview recorded in the database. It is (correctly) marked a bot though, so it doesn't show in the UI.

arp242 avatar Jun 06 '22 16:06 arp242

0.0.0.0 - - [06/Jun/2022:15:42:32 +0100] "-" 408 - "-" "-"

408 is a request timeout; so I guess what happened is that someone established a TCP connection to your server but never sent any data, hence there's nothing to log.

Those other two GET %. and PRI * are probably people sending junk data, maybe to scan for specific security problems in some webapps or the like. Looking at the Apache docs the %r just logs the first request line.

Anyhow, all of these are just bad/junk requests. The error messaging on this could be improved a bit, but I'm not sure if there's anything more we can/should do with this; it intentionally filters as little as possible by default so people have the flexibility to do whatever with it; if you want to use goatcounter to track WebDav requests or do something else crazy then you can!

I could maybe add a new special -exclude junk keyword to filter this kind of stuff though, similar to how you can already do -exclude static to automatically exclude most static files.

arp242 avatar Jun 06 '22 16:06 arp242

I fixed the outright bugs last year, and I'm not sure anything more needs to be done as I never got a reply? Closing anyway – can always reopen if more work needs to be done.

arp242 avatar Dec 15 '23 10:12 arp242