integreat-cms icon indicating copy to clipboard operation
integreat-cms copied to clipboard

Race conditions in tree operations causing Internal Server Errors

Open timobrembeck opened this issue 2 years ago • 21 comments

Describe the Bug

When content forms are submitted several times in succession too quickly, the same translation version can be created multiple times.

Steps to Reproduce

Send edit requests multiple times after another (for me, it didn't work to push the "save" button multiple times, but I could simulate this beaviour and we know that it already happened in the production system, so somehow this problem definitely exists)

Maybe, this is a combination of autosave and the real saving process?

Expected Behavior

There should only exist exactly one database object for each translation version.

Actual Behavior

In some cases, a version can be created multiple times, which breaks the CMS functionality in multiple places.

timobrembeck avatar Jun 01 '22 11:06 timobrembeck

As a first approach we could deactivate the form buttons after they got clicked. That won't really hurt and we can eliminate this possible source for the problem.

ulliholtgrave avatar Jun 29 '22 12:06 ulliholtgrave

Can you specify that a bit more that we can priorize it in our service team?

dkehne avatar Jul 20 '22 09:07 dkehne

Can you specify that a bit more that we can priorize it in our service team?

We already released a quick fix so no duplicate versions are created and an internal server error is shown instead. If you received no complaints from cities that they get errors when trying to save pages, this does not seem to be a big problem anymore. But we can also evaluate our log files to estimate how frequently this error occurs.

timobrembeck avatar Jul 20 '22 09:07 timobrembeck

However, this is probably the same problem that created the two Landshuts yesterday. As this error always requires some manual cleanup I would prioritize it rather high.

ulliholtgrave avatar Jul 20 '22 09:07 ulliholtgrave

For reference: This is the corresponding issue in our tree library which could have an impact on this: https://github.com/django-treebeard/django-treebeard/issues/53

timobrembeck avatar Sep 17 '22 09:09 timobrembeck

One example traceback during link replacement:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/views/generic/base.py", line 70, in view
    return self.dispatch(request, *args, **kwargs)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/views/generic/base.py", line 98, in dispatch
    return handler(request, *args, **kwargs)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/integreat_cms/cms/views/linkcheck/linkcheck_list_view.py", line 179, in post
    new_translation.save()
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/integreat_cms/cms/models/abstract_content_translation.py", line 470, in save
    super().save(*args, **kwargs)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 739, in save
    self.save_base(using=using, force_insert=force_insert,
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 776, in save_base
    updated = self._save_table(
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 881, in _save_table
    results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 919, in _do_insert
    return manager._insert(
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/query.py", line 1270, in _insert
    return query.get_compiler(using=using).execute_sql(returning_fields)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1416, in execute_sql
    cursor.execute(sql, params)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/cacheops/transaction.py", line 98, in execute
    result = self._no_monkey.execute(self, sql, params)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.IntegrityError: duplicate key value violates unique constraint "pagetranslation_unique_version"
DETAIL:  Key (page_id, language_id, version)=(9652, 1, 3) already exists.

Traceback in page form:

    super().save(*args, **kwargs)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 739, in save
    self.save_base(using=using, force_insert=force_insert,
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 776, in save_base
    updated = self._save_table(
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 881, in _save_table
    results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 919, in _do_insert
    return manager._insert(
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/query.py", line 1270, in _insert
    return query.get_compiler(using=using).execute_sql(returning_fields)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1416, in execute_sql
    cursor.execute(sql, params)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/cacheops/transaction.py", line 98, in execute
    result = self._no_monkey.execute(self, sql, params)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.IntegrityError: duplicate key value violates unique constraint "pagetranslation_unique_version"
DETAIL:  Key (page_id, language_id, version)=(15753, 3, 18) already exists.

timobrembeck avatar Sep 17 '22 13:09 timobrembeck

As far as I understand there are two things that go wrong:

  • Treebeard does ignore the atomic view. It does not use the Django ORM but directly interacts with the database. It could honor the atomic view decorator and the issue would be solved.
  • Treebeard itself does not use a transaction for the multiple database updates it performs.

Is this correct?

svenseeberg avatar Sep 17 '22 13:09 svenseeberg

Yes, however converting the library to using the Django ORM is probably unrealistic (even for a large upstream contribution), and I don't know how a custom mutex would be implemented here... maybe the same way as django-linkcheck would suffice?

https://github.com/DjangoAdminHackers/django-linkcheck/blob/8cb97b165b1d26f498896c8cda61f26d98f7a8ec/linkcheck/init.py#L7

timobrembeck avatar Sep 17 '22 13:09 timobrembeck

maybe the same way as django-linkcheck would suffice?

True, this would probably solve most issues that do currently arise. There are some edge cases (crashing processes, full hard disk, ...) for which a real transaction would be even better. But these are probably extremely rare in comparison to those race conditions.

svenseeberg avatar Sep 17 '22 15:09 svenseeberg

I grepped the logs today. We have only 12 of these messages in 14 days. I think it is okay to remove the "urgent" label.

svenseeberg avatar Dec 06 '22 08:12 svenseeberg

Removing the urgent label gave us bad karma it seems: in the last week we had broken tree structures in the database. It seems that changes in the tree structure were aborted right in the middle of a change. While there was no error in the application log, the solution seems to be the same as this issue here: we need transactions in Tree Beard.

svenseeberg avatar Dec 14 '22 11:12 svenseeberg

As mentioned in https://github.com/digitalfabrik/integreat-cms/pull/2082#issuecomment-1438535513, I'm pretty sure this is not completely resolved yet and needs further attention.

timobrembeck avatar Feb 21 '23 14:02 timobrembeck

Okay, maybe I have to stop referencing the issue in commits that I force push ;-)

svenseeberg avatar Feb 21 '23 14:02 svenseeberg

As mentioned in #2082 (comment), I'm pretty sure this is not completely resolved yet and needs further attention.

To make sure I understand what you're getting at: While the move() method is called whenever a page is saved, the page object is additionally being saved by the Django ORM. That means it touches the lft, rgt & tree_id values, potentially at the same time when another request/process has the class lock acquired. That would basically mean we need to acquire the lock when calling the save() method?

svenseeberg avatar Feb 22 '23 07:02 svenseeberg

I think the culprit is the somewhat confusing MoveNodeForm which is also instantiated in the normal page form. This also calls e.g. the model methods add_root()/add_child()/add_sibling(), even sometimes when the position of a page is not changed. (Because it always uses sibling nodes to remember the position of a page, a move from "left sibling of second sub-page" to "first child of parent page" causes the database queries to be executed, even if they do not change the position. But still, they have the potential of breaking something when running concurrently.)

So there are places where raw SQL queries are executed besides save() and move(), e.g. here. Another argument why this should reside in the library rather than in our CMS :sweat_smile:

I'm not completely sure which place would make the most sense here, we could e.g. overwrite the save()-method of our CustomTreeNodeForm? This will however cause dead locks in some cases, because the move() method is called somewhere in the sub-steps of the form's save() method... :cry:

Is there a way of checking whether the lock is already acquired by the currently executing code? Probably not, right? :see_no_evil:

So maybe we have to rigorously duplicate the mutex code for every model method which touches the database?

timobrembeck avatar Feb 22 '23 13:02 timobrembeck

Right. I expect we should see at least a decrease in our error rate with the current solution. If that is the case I'd be confident enough in the solution to create an upstream PR that fixes the problem for all queries.

svenseeberg avatar Feb 22 '23 14:02 svenseeberg

Quoting from the other issue:

Now that we have the mutex, I think we can definitely say that we can barely see a difference to before. We currently have 18 duplicate key violations in our log files. AFAICT this is roughly the same amount we had before we introduced the locking mechanism. I think we should remove the mutex and need to rethink the cause of the duplicate key violations.

I'm actually not sure anymore, I could also imagine that page moving was never really the problem and most of the race conditions come from normal page form saving (as described above)... If that's the case, then the lack of reduction of errors does not necessarily mean that django-db-mutex is not working, but maybe only that we're using it in the wrong places. Before discarding that library completely, we could also try to wrap all the node model methods into the mutex as well.

timobrembeck avatar Mar 26 '23 21:03 timobrembeck

Another idea from @ulliholtgrave: We could also move the page tree into their separate db tables based on the region, this would also make sure that they're independent and prevent and cross-region-race-conditions... This would also require some refactoring, see e.g. https://stackoverflow.com/questions/5036357/single-django-model-multiple-tables

timobrembeck avatar May 08 '23 14:05 timobrembeck

https://github.com/django-treebeard/django-treebeard/issues/278 could be relevant to solve this issue.

svenseeberg avatar Sep 26 '23 16:09 svenseeberg

Another idea from @ulliholtgrave: We could also move the page tree into their separate db tables based on the region, this would also make sure that they're independent and prevent and cross-region-race-conditions... This would also require some refactoring, see e.g. https://stackoverflow.com/questions/5036357/single-django-model-multiple-tables

Personally, I dislike for the idea of creating tables per region. That happened in WP and always was tedious to work with.

svenseeberg avatar Sep 27 '23 07:09 svenseeberg