moby icon indicating copy to clipboard operation
moby copied to clipboard

Make log reading more robust to errors

Open cpuguy83 opened this issue 1 year ago • 7 comments

Fundamentally this change (beyond the minor refactor and tracing spans) is to not treat errors reading from log files as fatal to providing logs to the client.

The need for this change is because often log files may be corrupted either due to old bugs in dockerd or for other reasons. One specific thing we have seen seems to be caused by ext4 by default is not doing full journaling. On power loss or kernel panic this can cause the log data to be corrupted. We see this mostly on IoT devices which are often not on battery backup or frequently hard cut from power.

Instead of erroring out a stream on errors from a particular file this just skips on to the next file.

I had made a previous attempt to be a little more clever and scrub the file to try and find a place where we can start parsing again but realized this can cause serious issues and is generally error prone.

This also takes care of a long-standing todo to handle log files lazily (or at least as lazily as possible) so we only end up decompressing files just-in-time.

Unfortunately we don't have a way to return warnings to the client that we had to skip files, however with the tracing changes at least it could be seen there and in the daemon logs.

When reading logs with the `jsonfile` or `local` log drivers, any errors while trying to read or parse underlying log files will cause the rest of the file to be skipped and move to the next log file (if one exists) rather than returning an error to the client and closing the stream.
The errors are viewable in the dockerd logs and exported to traces when tracing is configured.

When reading log files, compressed log files are now only decompressed when needed rather than decompressing all files before starting the log stream.

cpuguy83 avatar Jun 14 '24 01:06 cpuguy83

The issue with Windows seems to be caused by the frequency with which time is sampled on Windows, which is by default every 15ms. We can request greater precision from windows (e.g. 1ms which is what linux has) but this has performance implications.

I looked at using mod time but on linux filesystems the precision is in ms, not ns so its not reliable (tests fail on Linux here but pass on Windows, actually).

Trying to not have to digest every log file when we compress it. Sticking the the time theme, we could add another field to our gzip header that uses a precise time that is just the time we rotated it for compression.

cpuguy83 avatar Jun 18 '24 18:06 cpuguy83

Ok I think this is ready. @corhere I put back sharedtemp as discussed.

cpuguy83 avatar Jun 26 '24 16:06 cpuguy83

=== FAIL: github.com/docker/docker/daemon/logger/loggerutils TestTailFiles/handle_corrupted_data (10.01s)
    logfile_test.go:189: timeout waiting for log message

corhere avatar Jun 28 '24 18:06 corhere

So unfortunately its not that straight forward to have forwarder manage everything and still return just a bool because logfile following needs to have different error handling semantics from tailing.

What I ended p doing here is wrapping the decoder's Decode function from tailFiles to make it so if the underlying decoder returns an error then the error is wrapped in an io.EOF, which allows for this different error handling semantics.

cpuguy83 avatar Jul 05 '24 18:07 cpuguy83

@corhere I think I've addressed everything here.

cpuguy83 avatar Jul 22 '24 19:07 cpuguy83

Hmmm, interesting panic:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x556232a85fdd]

goroutine 57732 [running]:
github.com/docker/docker/daemon/containerd.(*ImageService).imageSummary.func1.2(0xc0003c9b80?)
	/go/src/github.com/docker/docker/daemon/containerd/image_list.go:358 +0xfd
github.com/docker/docker/container.(*memoryStore).ApplyAll.func1(0x0?)
	/go/src/github.com/docker/docker/container/memory_store.go:77 +0x23
created by github.com/docker/docker/container.(*memoryStore).ApplyAll in goroutine 57632
	/go/src/github.com/docker/docker/container/memory_store.go:76 +0x52
[rootlesskit:child ] error: command [/usr/local/bin/dockerd-rootless.sh --debug --host unix:///go/src/github.com/docker/docker/bundles/test-integration/docker.sock --storage-driver overlayfs --pidfile bundles/test-integration/docker.pid --userland-proxy=true --experimental] exited: exit status 2

Unrelated to this PR, but what's strange here this stack trace matches the code from master, but this PR is based off a master at the time where this code was not present yet.

github.com/docker/docker/daemon/containerd.(*ImageService).imageSummary.func1.2(0xc0003c9b80?) /go/src/github.com/docker/docker/daemon/containerd/image_list.go:358 +0xfd

Code from this PR: https://github.com/moby/moby/blob/6d941222173a11ed25c3a5daa75c21e03fe28ea4/daemon/containerd/image_list.go#L356-L358

Code from master: https://github.com/moby/moby/blob/d8f079deb6d9faf6540b756baec42ede172cd53e/daemon/containerd/image_list.go#L356-L358

vvoland avatar Aug 23 '24 09:08 vvoland

Added changelog description.

cpuguy83 avatar Aug 26 '24 17:08 cpuguy83

Opened backport PR here: https://github.com/moby/moby/pull/48842

cpuguy83 avatar Nov 09 '24 17:11 cpuguy83

Cross-referencing, in case this patch would make its way in other branches;

  • https://github.com/moby/moby/issues/49075
  • https://github.com/moby/moby/pull/49078

thaJeztah avatar Dec 12 '24 13:12 thaJeztah