loki icon indicating copy to clipboard operation
loki copied to clipboard

promtail: fix syslog udp reader blocking forever

Open unawarez opened this issue 2 years ago • 1 comments

What this PR does / why we need it:

Prior to this patch, incoming UDP syslogs would only be sent to promtail targets all at once when promtail shut down. The UDP transport internally uses io.Pipes between the main receive loop and the parser. The main loop did not close the write end of the pipe after writing the packet, so the parser blocked forever on reading. Now the main loop explicitly closes the pipe.

Which issue(s) this PR fixes:

#6772 is probably this

Special notes for your reviewer:

Checklist

  • [X] Reviewed the CONTRIBUTING.md guide
  • [ ] Documentation added (n/a)
  • [ ] Tests updated (n/a)
  • [ ] CHANGELOG.md updated
  • [ ] Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md (n/a)

unawarez avatar Sep 23 '22 00:09 unawarez

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Sep 23 '22 00:09 CLAassistant

I'm backing this out to a draft, because it created another bug on a different machine... I think it's something about having a consistent source port for the packets, which points to something about UDPTransport.streams keeping pipes based on source address.

unawarez avatar Sep 23 '22 01:09 unawarez

OK I've now seen this work on both macOS and FreeBSD, forwarding real syslogs from the latter.

I just could not get any logs to post without closing the internal io.Pipes. I sort of wonder if it's framing-related, and if writing an octet count in front of the packet would make it work without closing...

Closing the pipes also meant they couldn't be persisted in the streams map. Without that, it might be worth directly using the non-stream-based parsers from go-syslog and dropping the internal pipes altogether for UDP...

unawarez avatar Sep 23 '22 02:09 unawarez

Thanks @unawarez for looking into this again!

Closing the pipes also meant they couldn't be persisted in the streams map. Without that, it might be worth directly using the non-stream-based parsers from go-syslog and dropping the internal pipes altogether for UDP...

That sounds like a reasonable approach to me. However, I would like to know whether this has performance implications. Writing a benchmark test would be great!

chaudum avatar Sep 23 '22 10:09 chaudum

Cool, I can try writing a benchmark test over the next couple days

unawarez avatar Sep 23 '22 17:09 unawarez

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

stale[bot] avatar Nov 02 '22 07:11 stale[bot]

@unawarez Can you rebase this PR? The CI is failing due to a change in the protobuf files. Also did you have any luck writing a benchmark?

MichelHollands avatar Nov 07 '22 14:11 MichelHollands

Haven’t had time unfortunately… I can definitely rebase this tonight, and if I can find a similar benchmark in this repo to start from then maybe I could put that together while I’m at it.

On Mon, Nov 7, 2022, at 08:23, Michel Hollands wrote:

@unawarez https://github.com/unawarez Can you rebase this PR? The CI is failing due to a change in the protobuf files. Also did you have any luck writing a benchmark?

— Reply to this email directly, view it on GitHub https://github.com/grafana/loki/pull/7231#issuecomment-1305690863, or unsubscribe https://github.com/notifications/unsubscribe-auth/A3HIGZERKABUPWAXFKJ5WODWHEGFDANCNFSM6AAAAAAQTRLPT4. You are receiving this because you were mentioned.Message ID: @.***>

unawarez avatar Nov 08 '22 01:11 unawarez

somehow f7fa4c4 was categorically worse, and did not fix the extra newline test failure, which main does not have

unawarez avatar Nov 08 '22 05:11 unawarez

I found there is already a Benchmark_SyslogTarget. On my machine I get this for current main, a1d46d2bcc077e75333f2b23a0bcbe08dc8dadd6:

% (cd ../../pkg/promtail/targets/syslog; go clean && go test -bench=SyslogTarget -run='^#' -count=5)
goos: darwin
goarch: arm64
pkg: github.com/grafana/loki/clients/pkg/promtail/targets/syslog
Benchmark_SyslogTarget/tcp-10         	      1	1001671917 ns/op	  46704 B/op	    437 allocs/op
Benchmark_SyslogTarget/tcp-10         	      1	1001956250 ns/op	  46160 B/op	    433 allocs/op
Benchmark_SyslogTarget/tcp-10         	      1	1001519250 ns/op	  46368 B/op	    431 allocs/op
Benchmark_SyslogTarget/tcp-10         	      1	1001915375 ns/op	  46736 B/op	    432 allocs/op
Benchmark_SyslogTarget/tcp-10         	      1	1002232375 ns/op	  45792 B/op	    429 allocs/op
Benchmark_SyslogTarget/udp-10         	      1	1001507250 ns/op	  45096 B/op	    468 allocs/op
Benchmark_SyslogTarget/udp-10         	      1	1001478791 ns/op	  46048 B/op	    476 allocs/op
Benchmark_SyslogTarget/udp-10         	      1	1001407458 ns/op	  45416 B/op	    471 allocs/op
Benchmark_SyslogTarget/udp-10         	      1	1001726917 ns/op	  45992 B/op	    477 allocs/op
Benchmark_SyslogTarget/udp-10         	      1	1001358750 ns/op	  45352 B/op	    470 allocs/op
PASS
ok  	github.com/grafana/loki/clients/pkg/promtail/targets/syslog	10.718s

On this PR 94f4b16ffb2f7ef835190c7f4cfc6461686eb4b8 I get this instead:

Benchmark_SyslogTarget/tcp-10         	      1	1001681458 ns/op	  46296 B/op	    436 allocs/op
Benchmark_SyslogTarget/tcp-10         	      1	1002053958 ns/op	  46736 B/op	    432 allocs/op
Benchmark_SyslogTarget/tcp-10         	      1	1001664959 ns/op	  46048 B/op	    432 allocs/op
Benchmark_SyslogTarget/tcp-10         	      1	1002507042 ns/op	  46112 B/op	    429 allocs/op
Benchmark_SyslogTarget/tcp-10         	      1	1001686292 ns/op	  46400 B/op	    432 allocs/op
Benchmark_SyslogTarget/udp-10         	      1	1002141375 ns/op	 161752 B/op	    802 allocs/op
Benchmark_SyslogTarget/udp-10         	      1	1002111208 ns/op	 158616 B/op	    793 allocs/op
Benchmark_SyslogTarget/udp-10         	      1	1001899958 ns/op	 159288 B/op	    783 allocs/op
Benchmark_SyslogTarget/udp-10         	      1	1001866958 ns/op	 161096 B/op	    798 allocs/op
Benchmark_SyslogTarget/udp-10         	      1	1001939333 ns/op	 156520 B/op	    776 allocs/op

Much worse heap activity for sure, and still has the extra newline test failure. I'll see what else I can do.

But! There's another interesting story if you run more than one iteration. That's why I was using -count=5 above. If I replace -count=5 with -benchtime=2s then even on current main a1d46d2bcc077e75333f2b23a0bcbe08dc8dadd6 the UDP benchmark errors out:

% (cd ../../pkg/promtail/targets/syslog; go clean && go test -bench=SyslogTarget -run='^#' -benchtime=2s)
goos: darwin
goarch: arm64
pkg: github.com/grafana/loki/clients/pkg/promtail/targets/syslog
Benchmark_SyslogTarget/tcp-10         	  31761	    64496 ns/op	  32235 B/op	    360 allocs/op
Benchmark_SyslogTarget/udp-10         	--- FAIL: Benchmark_SyslogTarget/udp-10
    syslogtarget_test.go:350: 
        	Error Trace:	/Users/me/shelf/loki/clients/pkg/promtail/targets/syslog/syslogtarget_test.go:350
        	           				/Users/me/shelf/loki/clients/pkg/promtail/targets/syslog/benchmark.go:193
        	           				/Users/me/shelf/loki/clients/pkg/promtail/targets/syslog/benchmark.go:334
        	           				/Users/me/shelf/loki/clients/pkg/promtail/targets/syslog/asm_arm64.s:1172
        	Error:      	Condition never satisfied
        	Test:       	Benchmark_SyslogTarget/udp
        	Messages:   	expected: 9210 got:2072
--- FAIL: Benchmark_SyslogTarget
FAIL
exit status 1
FAIL	github.com/grafana/loki/clients/pkg/promtail/targets/syslog	44.354s

Maybe it's the "udp messages never emitted" bug manifesting, or maybe it's a coincidence.

Either way I should have time tomorrow to try out the string-based parser (https://github.com/grafana/loki/pull/7231#issuecomment-1255740151) and be able to get a comparison against main using this benchmark. I have some hope that approach could get past the extra newline issue.

unawarez avatar Nov 08 '22 06:11 unawarez

I dug pretty deep into the tests trying to figure out the flakiness. I'm about to rebase and start over with a commit that only makes some changes to the unit test and benchmark for UDP. The biggest thing is I think octet counting inside UDP packets is not allowed. Octet counting is only defined in the TCP RFC 6587. The UDP RFC 5426 just says the packet contents are formatted "according to RFC 5424".

unawarez avatar Nov 09 '22 01:11 unawarez

OK, here is a comparison of all 3 commits above. Sadly -benchtime 1.1s still makes all the UDP benchmarks die on all these commits, so it's still a -count 5 comparison.

Anyway I'm inclined to think udp_octetcounting and udp_newline failures are not true problems based on RFC digging noted above. I think the extra newline failure is because the test sender is putting a newline in the packet, and now the UDP pipeline is faithfully preserving that newline.

edit: deleted giant block of outdated test & bench output

unawarez avatar Nov 09 '22 03:11 unawarez

I don't think I'll be able to keep digging, but as one more experiment with the benchtime=1.1s UDP benchmarks failing, I added a 1ms sleep after this line https://github.com/grafana/loki/blob/1f0a3c291d3e7498982a06e7f542697008af7d53/clients/pkg/promtail/targets/syslog/syslogtarget_test.go#L348 and then the benchmark passed. That makes me think the failure is from some UDP packets actually getting dropped when too many are sent at once...

unawarez avatar Nov 09 '22 05:11 unawarez

I learned more about some golang APIs and looked at syslogtarget_test again, and I'd actually like to propose another commit that tightens up the benchmark implementation. AFAICT it only had a 1s measurement resolution because it's relying on require.Eventually to block with a 1s tick. I'd like to write another fake Client impl that directly exposes a sync primitive for the caller to block on. I should have an example of that here pretty soon.

unawarez avatar Nov 10 '22 04:11 unawarez

+1 on this PR, it's a very important one if you want, like me, to use promtail+syslog for nginx, because nginx doesn't support TCP for syslog.

For more context, we are about to chose our logging component for our platform and this small bug can prevent us to use promtail, even if we already use loki for aggregation.

I really hope we'll be able to see this PR merged soon! 😇

davinkevin avatar Dec 06 '22 16:12 davinkevin

Because I left this with a bit of an infodump, I'll write a summary of where I left it:

  • The benchmark that already exists wasn't really working because it blocked on a 1-second wait for completion. The first commit is just about tightening that up, so probably that commit will be helpful no matter what. The other commits depend on making a bunch of decisions that I don't feel like I can drop in and make, so I think you could edit and/or cherry-pick some.

    One Heisenbug that still exists with the UDP benchmarks is, if it sends too many packets too fast then some packets are genuinely dropped by the OS. So the benchmark works fine up to some number of simultaneous requests and then it gets flaky. I didn't do any work toward fixing that.

  • Commit 2 is the first maybe-opinionated commit. Currently the tests and benchmarks are parameterized as {tcp newline-separated, tcp octet-counted, udp newline-separated, udp octet-counted}. I added a "udp raw" variant, which means the UDP payload exactly equals the syslog message. I think this is the only well-defined format for UDP syslogs (see mentions of RFCs some comments above). The other part of the story with these parameterized tests is: the UDP newline and UDP octet-counted variants fail on my new implementations, and tl;dr I think a valid way to fix those test failures is to remove UDP-newline and UDP-octet-counted, but I didn't do that in any commit here. I believe those UDP test variants have only been passing because of the current implementation details, and I also think the lack of a "UDP raw" flavor was how this PR's original issue was never caught. There are more details in the comments/commits.

  • The last 2 commits are 2 different approaches for fixing this, using the stream-based syslogparser or the string-based syslogparser. All the commits in this PR have test and bench output in the commit messages, but it will be a little outdated now, and unfortunately I don't have the env set up anymore to re-run them. In the end the numbers were disappointing compared to the original implementation. But they both did solve the issue. Maybe someone with real Golang experience could improve. (Looking at the syslogparser dependency's code could maybe be worthwhile too...)

Happy to explain anything, otherwise I leave it up to the maintainers to take/edit what they will :v:

unawarez avatar Dec 12 '22 02:12 unawarez

This looks good so far. CI is failing because you've got a test failure and a linting error.

MasslessParticle avatar Jan 18 '23 22:01 MasslessParticle

@unawarez would you like us to take over?

jeschkies avatar Jan 31 '23 14:01 jeschkies

@MasslessParticle @jeschkies yes, see https://github.com/grafana/loki/pull/7231#issuecomment-1345765029

unawarez avatar Jan 31 '23 19:01 unawarez

@unawarez World you grant me write access to your form so I can push to this branch? This would simplify things.

jeschkies avatar Feb 01 '23 09:02 jeschkies

I've took a closer look and I'm not sure this approach is the correct one. Seems the original issue https://github.com/grafana/loki/issues/6772 has more details.

jeschkies avatar Feb 02 '23 11:02 jeschkies

Based on the comments/activity for this one, I'm going to close it for now

MasslessParticle avatar Mar 31 '23 17:03 MasslessParticle