sysbox icon indicating copy to clipboard operation
sysbox copied to clipboard

Sysbox do not clean `/var/lib/sysbox/docker/`

Open shinji62 opened this issue 3 years ago • 24 comments

Hi @ctalledo and @rodnymolina I am using sysbox on EKS 1.21, using the installer provided for 0.5.2. After some time we discover that our nodes are being filled up with all the old container, was never cleanup.

I check the running container using runc list which return 20 of them, when I check /var/lib/sysbox/docker/ I got 770 directory.

Not sure what is wrong, but something is really weird.

This will become a blocker quickly for us because pod are deleted/ created quite frequently CI/CD/

shinji62 avatar Jul 12 '22 09:07 shinji62

After debugging via slack with @shinji62, seems the problem is that under some conditions sysbox misses the fact that the container was removed, and as a result does not remove the corresponding dir under /var/lib/sysbox/docker. It's not clear why this happens, but it's not frequent (otherwise we would have seen this issue previously). The sysbox test suite does not reproduce the issue.

Sysbox performs the detection of the container's rootfs removal via the fsnotify golang library. The sysbox-mgr component is responsible for doing this detection.

Looking at the fsnotify repo, seems there are some conditions under which the library may miss file/directory removal events. For example: https://github.com/fsnotify/fsnotify/issues/404

@shinji62 and I are experimenting with a test-build of sysbox-mgr that contains a potential fix for the problem. The fix is simple: rather than simply looking for fsnotify file/directory "REMOVE" events to check if the container's rootfs has been removed, sysbox-mgr checks if the rootfs is present (using lstat()) on every fsnotify event.

If this fix does not work, I have another fix in mind but it's a bit more complex to implement.

ctalledo avatar Jul 20 '22 01:07 ctalledo

The referenced fsnotify issue shouldn't be relevant in this context. Rename and remove events are not dropped, but other types of events may be dropped if the file has been removed before the event has been sent to the user.

This behavior will likely change in the near future.

My guess is that there's some circumstance under which the remove event isn't being processed correctly. I looked briefly at your usage of fsnotify, and didn't see any clear issues.

horahoradev avatar Jul 23 '22 22:07 horahoradev

My guess is that there's some circumstance under which the remove event isn't being processed correctly. I looked briefly at your usage of fsnotify, and didn't see any clear issues.

Thanks @horahoradev; one thing I noticed reviewing the sysbox-mgr code is that the goroutine that processes the fsnotify event channel was then performing an action that in some scenarios could potentially take several seconds to complete, before it read from the fsnotify event channel again. I don't know if this could be causing fsnotify events to be dropped (i.e., if the frequency of such events is much higher than the rate at which they are being processed), but to avoid problems I created a local fix that dispatches the slow action in a separate goroutine. @shinji62 will be testing this soon I think.

ctalledo avatar Jul 24 '22 02:07 ctalledo