OnlineJudge icon indicating copy to clipboard operation
OnlineJudge copied to clipboard

Change Select JudgeServerLogic

Open rrutwik opened this issue 3 years ago • 7 comments

Change Select JudgeServerLogic to prevent deadlock.

servers = JudgeServer.objects.select_for_update().filter(is_disabled=False).order_by("task_number")
            servers = [s for s in servers if s.status == "normal"]
            for server in servers:   => this will throw deadlock error, if order is changed due to change in task number by some other thread.

rrutwik avatar Nov 17 '21 07:11 rrutwik

How the deadlock caused?

Beichi-CHs avatar Nov 20 '21 08:11 Beichi-CHs

This => JudgeServer.objects.select_for_update().filter(is_disabled=False).order_by("task_number") [for server in servers: ]

this will result in deadlock, if order is changed due to change in task number by some other thread....

rrutwik avatar Nov 20 '21 08:11 rrutwik

filter(is_disabled=False, last_heartbeat__gt=health_time).annotate(percent=ExpressionWrapper((1.0000 * F('task_number')) / F('cpu_core'), output_field=FloatField())).order_by("percent")

this will select the judge server which is the fastest and has fewest tasks?

Beichi-CHs avatar Nov 20 '21 08:11 Beichi-CHs

yes, least (current total task/core ratio)

rrutwik avatar Nov 20 '21 12:11 rrutwik

thank you for your contribution, but i still do not understand the reason of the deadlock, could you give me more information about if, for example: the database deadlock log, the django error log, how to reproduce the bug.

virusdefender avatar Nov 21 '21 12:11 virusdefender

There was whole error log in dramatiq logs, but i guess it stores only 10 most recent files. Here is a log from gunicorn logs, which is because of this. DETAIL: Process 2904 waits for ExclusiveLock on tuple (2,32) of relation 16635 of database 16384; blocked by process 2834. Process 2834 waits for ShareLock on transaction 26060303; blocked by process 1859. Process 1859 waits for ShareLock on transaction 26060317; blocked by process 2656. Process 2656 waits for AccessExclusiveLock on tuple (2,32) of relation 16635 of database 16384; blocked by process 2904. HINT: See server log for query details. Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 85, in _execute return self.cursor.execute(sql, params) psycopg2.extensions.TransactionRollbackError: deadlock detected

To reproduce this bug, you need to have 3 to 4 (8 core) judge servers processors, and just DDOS it with 30 to 60 submissions per second. You will be able to find this error in dramatiq logs. [2 judge servers would be okay i guess, but 3-4 would be much better as ordering by task_number will throw more error]

This can help:- https://stackoverflow.com/a/42731706

rrutwik avatar Nov 21 '21 13:11 rrutwik

@virusdefender Hi, any updates on this PR ?

rrutwik avatar Jul 23 '22 13:07 rrutwik