atop
atop copied to clipboard
broken atopacct blocks atop indefinetely
Hi,
when something goes wrong in atopacct, it keeps a system-wide semaphore which causes subsequent calls to atop to stall indefinetely in
getuid() = 1000
setresuid(-1, 1000, -1) = 0
semtimedop(1, [{0, -1, SEM_UNDO}, {1, -1, SEM_UNDO}], 2, NULL
This happens when the debian package is installed on a s390x system. Unfortunately, I don't have root on that system and can therefore not see what atopacct does when it happens. The other arches Debian builds for are fine.
Therefore, this issues has two parts:
- atopacct should not block the semaphore on s390x systems
- atop itself should time out and terminate with a meaningful error message if it cannot obtain the semaphore
Due to this issue, atop will be removed from Debian testing next week.
Have you tried clearing atopacct state rto resolve the issue? Something like: mv /var/run/pacct_shadow.d{,.orig} && systemctl start atopacct
The main problem is that I don't see this behavior on any box I have immediate shell access to. I cannot try anything there short of writing a test case, build that test case into an official package and upload this package to Debian. I'd really like to avoid that.
The real showstopper is that atop waits indefinetly and silently for the semaphore until the test is aborted with a timeout. As I wrote in the original bug report, we have two problems there that should both be addressed.
Marc
Part 2 of the issue has been solved: atop times out after waiting 3 seconds for the semaphore and then continues without process accounting.
I do not understand part 1 of the issue: in between the claiming of the semaphore in atopacctd and releasing it there are no blocking calls. Even if atopacctd would terminate after claiming the semaphore, the SEM_UNDO flag takes care of releasing the semaphore automatically.
Is it possible for you to gain root privileges on the test system to issue a system call trace with strace
to see where atopacctd blocks?
I currently dont have even shell access to the (only) test box that shows the behavior. I'm trying to find out whether atop 2.8.1 passes the test as it's really tight timing to get atop back into Debian testing (Debian is planning to freeze). I apologize for not having this prioritized properly.