integreat-cms
integreat-cms copied to clipboard
Race conditions in tree operations causing Internal Server Errors
Describe the Bug
When content forms are submitted several times in succession too quickly, the same translation version can be created multiple times.
Steps to Reproduce
Send edit requests multiple times after another (for me, it didn't work to push the "save" button multiple times, but I could simulate this beaviour and we know that it already happened in the production system, so somehow this problem definitely exists)
Maybe, this is a combination of autosave and the real saving process?
Expected Behavior
There should only exist exactly one database object for each translation version.
Actual Behavior
In some cases, a version can be created multiple times, which breaks the CMS functionality in multiple places.
As a first approach we could deactivate the form buttons after they got clicked. That won't really hurt and we can eliminate this possible source for the problem.
Can you specify that a bit more that we can priorize it in our service team?
Can you specify that a bit more that we can priorize it in our service team?
We already released a quick fix so no duplicate versions are created and an internal server error is shown instead. If you received no complaints from cities that they get errors when trying to save pages, this does not seem to be a big problem anymore. But we can also evaluate our log files to estimate how frequently this error occurs.
However, this is probably the same problem that created the two Landshuts yesterday. As this error always requires some manual cleanup I would prioritize it rather high.
For reference: This is the corresponding issue in our tree library which could have an impact on this: https://github.com/django-treebeard/django-treebeard/issues/53
One example traceback during link replacement:
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/core/handlers/exception.py", line 47, in inner
response = get_response(request)
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/core/handlers/base.py", line 181, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/views/generic/base.py", line 70, in view
return self.dispatch(request, *args, **kwargs)
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/views/generic/base.py", line 98, in dispatch
return handler(request, *args, **kwargs)
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/integreat_cms/cms/views/linkcheck/linkcheck_list_view.py", line 179, in post
new_translation.save()
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/integreat_cms/cms/models/abstract_content_translation.py", line 470, in save
super().save(*args, **kwargs)
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 739, in save
self.save_base(using=using, force_insert=force_insert,
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 776, in save_base
updated = self._save_table(
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 881, in _save_table
results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 919, in _do_insert
return manager._insert(
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/query.py", line 1270, in _insert
return query.get_compiler(using=using).execute_sql(returning_fields)
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1416, in execute_sql
cursor.execute(sql, params)
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/cacheops/transaction.py", line 98, in execute
result = self._no_monkey.execute(self, sql, params)
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
return executor(sql, params, many, context)
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
django.db.utils.IntegrityError: duplicate key value violates unique constraint "pagetranslation_unique_version"
DETAIL: Key (page_id, language_id, version)=(9652, 1, 3) already exists.
Traceback in page form:
super().save(*args, **kwargs)
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 739, in save
self.save_base(using=using, force_insert=force_insert,
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 776, in save_base
updated = self._save_table(
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 881, in _save_table
results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 919, in _do_insert
return manager._insert(
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/query.py", line 1270, in _insert
return query.get_compiler(using=using).execute_sql(returning_fields)
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1416, in execute_sql
cursor.execute(sql, params)
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/cacheops/transaction.py", line 98, in execute
result = self._no_monkey.execute(self, sql, params)
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
return executor(sql, params, many, context)
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
django.db.utils.IntegrityError: duplicate key value violates unique constraint "pagetranslation_unique_version"
DETAIL: Key (page_id, language_id, version)=(15753, 3, 18) already exists.
As far as I understand there are two things that go wrong:
- Treebeard does ignore the atomic view. It does not use the Django ORM but directly interacts with the database. It could honor the atomic view decorator and the issue would be solved.
- Treebeard itself does not use a transaction for the multiple database updates it performs.
Is this correct?
Yes, however converting the library to using the Django ORM is probably unrealistic (even for a large upstream contribution), and I don't know how a custom mutex would be implemented here... maybe the same way as django-linkcheck would suffice?
https://github.com/DjangoAdminHackers/django-linkcheck/blob/8cb97b165b1d26f498896c8cda61f26d98f7a8ec/linkcheck/init.py#L7
maybe the same way as django-linkcheck would suffice?
True, this would probably solve most issues that do currently arise. There are some edge cases (crashing processes, full hard disk, ...) for which a real transaction would be even better. But these are probably extremely rare in comparison to those race conditions.
I grepped the logs today. We have only 12 of these messages in 14 days. I think it is okay to remove the "urgent" label.
Removing the urgent label gave us bad karma it seems: in the last week we had broken tree structures in the database. It seems that changes in the tree structure were aborted right in the middle of a change. While there was no error in the application log, the solution seems to be the same as this issue here: we need transactions in Tree Beard.
As mentioned in https://github.com/digitalfabrik/integreat-cms/pull/2082#issuecomment-1438535513, I'm pretty sure this is not completely resolved yet and needs further attention.
Okay, maybe I have to stop referencing the issue in commits that I force push ;-)
As mentioned in #2082 (comment), I'm pretty sure this is not completely resolved yet and needs further attention.
To make sure I understand what you're getting at: While the move()
method is called whenever a page is saved, the page object is additionally being saved by the Django ORM. That means it touches the lft
, rgt
& tree_id
values, potentially at the same time when another request/process has the class lock acquired. That would basically mean we need to acquire the lock when calling the save()
method?
I think the culprit is the somewhat confusing MoveNodeForm
which is also instantiated in the normal page form.
This also calls e.g. the model methods add_root()
/add_child()
/add_sibling()
, even sometimes when the position of a page is not changed. (Because it always uses sibling nodes to remember the position of a page, a move from "left sibling of second sub-page" to "first child of parent page" causes the database queries to be executed, even if they do not change the position. But still, they have the potential of breaking something when running concurrently.)
So there are places where raw SQL queries are executed besides save()
and move()
, e.g. here.
Another argument why this should reside in the library rather than in our CMS :sweat_smile:
I'm not completely sure which place would make the most sense here, we could e.g. overwrite the save()
-method of our CustomTreeNodeForm
? This will however cause dead locks in some cases, because the move()
method is called somewhere in the sub-steps of the form's save()
method... :cry:
Is there a way of checking whether the lock is already acquired by the currently executing code? Probably not, right? :see_no_evil:
So maybe we have to rigorously duplicate the mutex code for every model method which touches the database?
Right. I expect we should see at least a decrease in our error rate with the current solution. If that is the case I'd be confident enough in the solution to create an upstream PR that fixes the problem for all queries.
Quoting from the other issue:
Now that we have the mutex, I think we can definitely say that we can barely see a difference to before. We currently have 18 duplicate key violations in our log files. AFAICT this is roughly the same amount we had before we introduced the locking mechanism. I think we should remove the mutex and need to rethink the cause of the duplicate key violations.
I'm actually not sure anymore, I could also imagine that page moving was never really the problem and most of the race conditions come from normal page form saving (as described above)... If that's the case, then the lack of reduction of errors does not necessarily mean that django-db-mutex is not working, but maybe only that we're using it in the wrong places. Before discarding that library completely, we could also try to wrap all the node model methods into the mutex as well.
Another idea from @ulliholtgrave: We could also move the page tree into their separate db tables based on the region, this would also make sure that they're independent and prevent and cross-region-race-conditions... This would also require some refactoring, see e.g. https://stackoverflow.com/questions/5036357/single-django-model-multiple-tables
https://github.com/django-treebeard/django-treebeard/issues/278 could be relevant to solve this issue.
Another idea from @ulliholtgrave: We could also move the page tree into their separate db tables based on the region, this would also make sure that they're independent and prevent and cross-region-race-conditions... This would also require some refactoring, see e.g. https://stackoverflow.com/questions/5036357/single-django-model-multiple-tables
Personally, I dislike for the idea of creating tables per region. That happened in WP and always was tedious to work with.