domjudge icon indicating copy to clipboard operation
domjudge copied to clipboard

Possible Linux kernel lock contention when running multiple judgedaemons per machine

Open taoky opened this issue 1 year ago • 3 comments

Description of the problem

When rejuding a large contest or getting a lot of submission for problems with many testcases, it could be possible that some submissions are taking much longer wall time than their CPU time. With a short timelimit overshoot these submissions might be judged as TLE even if they are correct.

And this is actually what happens in a recent ICPC Asia Regional Contest (with ~350 teams and an easy problem with 50 testcases). After taking a lot time bisecting kernel and debugging, it was found out that a lock contention issue (2 global locks: shrinker_rwsem and cgroup_mutex) in kernel < 6.3 under heavy load might block kernel operations such as cgroup and page fault handling inside memory cgroup for several seconds.

(This is fixed (or alleviated) after kernel commit https://github.com/torvalds/linux/commit/da27f796a832122ee533c7685438dad1c4e338dd)

Though it is impossible for judgedaemon (runguard) to "fix" this issue by code, mentioning the kernel issue in documentation could be helpful for server admins.

Your environment

  • DOMjudge/Webserver: any compatible version
  • OS: Ubuntu 22.04 with kernel 5.15 (default) or 6.2 (latest generic kernel in jammy repo)
  • Tested under a KVM with 32 cores and 21 or 30 judgedaemons, and a bare metal 2 CPUs (40 cores) server with 21 judgedaemons.

Steps to reproduce

Submit a correct solution many times at once like:

for i in $(seq 1 1000); ~/Downloads/domjudge-8.2.2/submit/submit --url http://localhost:12345/ --contest test -y G.cpp; end

And wait for it to be done.

Expected behaviour

Reasonable judgehost system load, and no submission takes a wall time much longer than its CPU time.

Actual behaviour

Judgehost system load >= 2 * judgedaemon number. With timelimit overshoot set to 1s|10%, some submissions are judged as TLE even they only take a very short CPU time. The judgement is very slow.

Any other information that you want to share?

https://github.com/DOMjudge/domjudge/pull/2157 mentions about "the call cgroup_delete_cgroup_ext did sometimes hang for multiple seconds". I'm afraid that a double check for this contest rejudgement might be necessary to ensure no correct solutions are judged as TLE...

If you are interested in this specific kernel issue, I have also written a blog post (Simp. Chinese) to help explain this to contestants affected in this regional contest, and for server admins in later contests.

taoky avatar Dec 06 '23 07:12 taoky

Thanks a lot for this big write up. We normally advice to not run many judgehosts on one machine (since there will always be some overhead) but it might indeed be worth it to mention this explicitly.

nickygerritsen avatar Dec 06 '23 08:12 nickygerritsen

Since you mentioned that disable CLONE_NEWIPC would fix this issue, how about using seccomp to restrict IPC related syscalls rather than create IPC namespace?

summershrimp avatar Dec 06 '23 09:12 summershrimp

Since you mentioned that disable CLONE_NEWIPC would fix this issue, how about using seccomp to restrict IPC related syscalls rather than create IPC namespace?

Theoretically yes, but it would be a bit difficult to list all IPC-related syscalls, and the potential side effects of using seccomp are unknown.

taoky avatar Dec 06 '23 13:12 taoky