go-syslog icon indicating copy to clipboard operation
go-syslog copied to clipboard

RFC3164: square brackets in the message screw parsing of ProcID

Open dimonomid opened this issue 4 years ago • 3 comments

Parser created as follows:

		p := rfc3164.NewParser(
			rfc3164.WithYear(rfc3164.Year{YYYY: 2020}),
			rfc3164.WithRFC3339(),
		)

Example input:

<0>Mar  1 09:38:48 myhost myapp[12345]: foo [bar] baz

Output (take a look at ProcID):

{
  "Facility": 0,
  "Severity": 0,
  "Priority": 0,
  "Timestamp": "2020-03-01T09:38:48Z",
  "Hostname": "myhost",
  "Appname": "myapp",
  "ProcID": "foo [bar",
  "MsgID": null,
  "Message": "foo [bar] baz"
}

dimonomid avatar Mar 01 '20 11:03 dimonomid

@dimonomid thanks again!

russorat avatar Mar 02 '20 19:03 russorat

I also encounter a variant of this issue (within telegraf) with the following message:

Aug  2 11:54:06 hostname e Random message

Which is parsed by telegraf as:

{
  "name": "syslog",
  "fields": {
    "facility_code": 3,
    "message": "Random message",
    "procid": "servicename[48737",
    "severity_code": 5,
    "timestamp": 1690979105000000000
  },
  "tags": {
    "appname": "catsitd",
    "facility": "daemon",
    "host": "localhost",
    "hostname": "myhost",
    "severity": "notice",
    "source": "X.X.X.X"
  },
  "timestamp": 1690971905
}

Notice how the message is trimmed. It should be servicename[48737]: Random message. Either the procid field pattern is too greedy (unlikely, as it would still include the original PID otherwise), or the parsing code takes the rightmost match as the ProcID.

The relevant code seems to be in rfc3164/machine.go.rl:

# The first not alphanumeric character starts the content (usually containing a PID) part of the message part
contentval = !alnum @err(err_contentstart) >mark print* %set_content @err(err_content);

content = '[' contentval ']'; # todo(leodido) > support ':' and ' ' too. Also they have to match?

Here, contentval seems to also match any separator that would fit between the [ ] pair. I'm not familiar with this lexer though so I have no idea how to fix it.

z3bra avatar Aug 02 '23 10:08 z3bra

So I've quickly read about Ragel SM parsing, and could come up with a solution which doesn't break any test. I've also added a test case with my previous example, to confirm it's working.

In the code, the content field is apparently only used for parsing the procid. You can tell by the content = '[' contentval ']', which limits matching to what's inside the square brackets. The RFC3164 states that the CONTENT should be everything that immediately follows the TAG. In this case the ProcID is only a small part of the CONTENT, which had me confused when I first read this parser.

z3bra avatar Aug 02 '23 13:08 z3bra