OnlineJudge
OnlineJudge copied to clipboard
Change Select JudgeServerLogic
Change Select JudgeServerLogic to prevent deadlock.
servers = JudgeServer.objects.select_for_update().filter(is_disabled=False).order_by("task_number")
servers = [s for s in servers if s.status == "normal"]
for server in servers: => this will throw deadlock error, if order is changed due to change in task number by some other thread.
How the deadlock caused?
This => JudgeServer.objects.select_for_update().filter(is_disabled=False).order_by("task_number") [for server in servers: ]
this will result in deadlock, if order is changed due to change in task number by some other thread....
filter(is_disabled=False, last_heartbeat__gt=health_time).annotate(percent=ExpressionWrapper((1.0000 * F('task_number')) / F('cpu_core'), output_field=FloatField())).order_by("percent")
this will select the judge server which is the fastest and has fewest tasks?
yes, least (current total task/core ratio)
thank you for your contribution, but i still do not understand the reason of the deadlock, could you give me more information about if, for example: the database deadlock log, the django error log, how to reproduce the bug.
There was whole error log in dramatiq logs, but i guess it stores only 10 most recent files.
Here is a log from gunicorn logs, which is because of this.
DETAIL: Process 2904 waits for ExclusiveLock on tuple (2,32) of relation 16635 of database 16384; blocked by process 2834. Process 2834 waits for ShareLock on transaction 26060303; blocked by process 1859. Process 1859 waits for ShareLock on transaction 26060317; blocked by process 2656. Process 2656 waits for AccessExclusiveLock on tuple (2,32) of relation 16635 of database 16384; blocked by process 2904. HINT: See server log for query details. Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/django/db/backends/utils.py", line 85, in _execute return self.cursor.execute(sql, params) psycopg2.extensions.TransactionRollbackError: deadlock detected
To reproduce this bug, you need to have 3 to 4 (8 core) judge servers processors, and just DDOS it with 30 to 60 submissions per second. You will be able to find this error in dramatiq logs. [2 judge servers would be okay i guess, but 3-4 would be much better as ordering by task_number will throw more error]
This can help:- https://stackoverflow.com/a/42731706
@virusdefender Hi, any updates on this PR ?