synapse
synapse copied to clipboard
Deleting events causes database corruption
Deleting events - whether you do it via the purge_history admin API, the delete room admin API, or by enabling retention on your server - has potential to corrupt the database.
In short, the delete operations can race against other operations (notably event persistence), leading the database to get into an inconsistent state. The main reason is that the delete operations do not invalidate the in-memory caches, but in general there are a bunch of race conditions.
Examples include:
- https://github.com/matrix-org/synapse/issues/9864
- https://github.com/matrix-org/synapse/issues/11521
- non-existent event-forward-extremities (https://github.com/matrix-org/synapse/issues/12507#issuecomment-1142058611), or outlier forward extremities (#13026)
Other related issues:
- https://github.com/matrix-org/synapse/issues/11738
- https://github.com/matrix-org/synapse/issues/12880 (though I don't think that's database corruption, it's just stuff not working)
Is it a solution to delete the entire cache before deleting? And would that make sense?
Is this related to?
- #12821
- #3364
Is it a solution to delete the entire cache before deleting? And would that make sense?
Sadly that's not sufficient, since the cache might be immediately repopulated before the delete actually happens.
Is this related to?
I wouldn't say so. Those are the history purge deleting less stuff than they should; they do not cause inconsistency between Synapse and the database and cannot cause incorrect data in the database.
the delete operations can race against other operations (notably event persistence)
Incidentally: I think purging a complete room with no local members should be safe from races, but there are still cache coherency bugs (#11521).
I am a homeserver admin affected by this - database is ever growing and deleting 90+ days old messages is a requirement (privacy aware NGO). Anything I can do to help getting this issue resolved?
Something like https://github.com/matrix-org/synapse/issues/13916 would help in the case of correctly invalidating the EventsWorkerStore._get_event_cache
and EventsWorkerStore._event_ref
caches upon fully purging a room. Currently, those caches aren't keyed by room ID, and thus invalidating all event entries related to a room in those caches is not possible.
This leads to coherency bugs such as https://github.com/matrix-org/synapse/issues/11521.
We had a massive increase in the database in the last 15 days, about 50 GB. Now it's over 102 GB with about 200 users (20-30 really use it).
We had retention enabled (before the massive increase) to minimize the history for the user and have more storage at the end.
Now we are having trouble getting rid of some rooms, especially the first three:
room_id | num_rows
----------------------------------------------------------
Matrix HQ:matrix.org | 400673076
BlackArch:matrix.org | 7139888
Arch Linux:archlinux.org | 4855957
The #matrix:matrix.org room is deleted and not listed via admin api anymore but still in the database. Any chance to get rid off the few rows? :)
Is this problem now related to this issue here? We have blocked the matrix room for our users until we can solve the problem.
Little offtopic, but I wonder how matrix.org handles this massive amount of data? Constantly increasing storage requirement is the main reason I don't "idle" in huge public rooms using my HS account. I really hope that core team prioritizes this issue, because after many years, storage requirement starts to get out of hand :cry:
The #matrix:matrix.org room is deleted and not listed via admin api anymore but still in the database.
@Dan-Sun How exactly did you delete the room?
@DMRobertson
I tried API v1 and 2 (also using force_purge
) and both gave me that there were still users in that room that could not be kicked out. One of them was a deactivated user. So I reactivated this user and tried to leave the room manually and via API, which ended with an internal server error.
I asked Travis to try to kick the user for me. I don't know what actually happened in the end, but the user was no longer in the room and I ran API v2 again. It quickly finished the process. And that was really about it. Right now, I'm running the API v2 command every 2 seconds since yesterday. Sometimes it shows that it is still purging, but the row number doesn't change.
We've so far 7 rooms which are still in the database but not via API listed.
Leaving API error
Traceback (most recent call last):
File "/usr/lib/python3.10/site-packages/synapse/http/server.py", line 306, in _async_render_wrapper
callback_return = await self._async_render(request)
File "/usr/lib/python3.10/site-packages/synapse/http/server.py", line 512, in _async_render
callback_return = await raw_callback_return
File "/usr/lib/python3.10/site-packages/synapse/rest/client/room.py", line 995, in on_POST
await self.room_member_handler.update_membership(
File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 575, in update_membership
result = await self.update_membership_locked(
File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 1032, in update_membership_locked
return await self._local_membership_update(
File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 437, in _local_membership_update
result_event = await self.event_creation_handler.handle_new_client_event(
File "/usr/lib/python3.10/site-packages/synapse/util/metrics.py", line 113, in measured_func
r = await func(self, *args, **kwargs)
File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1382, in handle_new_client_event
result, _ = await make_deferred_yieldable(
File "/usr/lib/python3.10/site-packages/twisted/internet/defer.py", line 1696, in _inlineCallbacks
result = context.run(gen.send, result)
File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1479, in _persist_events
event = await self.persist_and_notify_client_events(
File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1871, in persist_and_notify_client_events
) = await self._storage_controllers.persistence.persist_events(
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 414, in persist_events
ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
File "/usr/lib/python3.10/site-packages/synapse/util/async_helpers.py", line 291, in yieldable_gather_results
raise dfe.subFailure.value from None
File "/usr/lib/python3.10/site-packages/twisted/internet/defer.py", line 1692, in _inlineCallbacks
result = context.run(
File "/usr/lib/python3.10/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 409, in enqueue
return await self._event_persist_queue.add_to_queue(
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 240, in add_to_queue
res = await make_deferred_yieldable(end_item.deferred.observe())
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 283, in handle_queue_loop
ret = await self._per_item_callback(room_id, item.task)
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 355, in _process_event_persist_queue_task
return await self._persist_event_batch(room_id, task)
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 673, in _persist_event_batch
res = await self._get_new_state_after_events(
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 878, in _get_new_state_after_events
old_state_groups = {
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 879, in <setcomp>
event_id_to_state_group[evid] for evid in old_latest_event_ids
KeyError: '$9n5rfTl0fNUdVLztun-SdVdWd_zSYjKhgR60p7SUxQs'
Delete room error
Traceback (most recent call last):
File "/usr/lib/python3.10/site-packages/synapse/handlers/room.py", line 1879, in shutdown_room
_, stream_id = await self.room_member_handler.update_membership(
File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 575, in update_membership
result = await self.update_membership_locked(
File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 1032, in update_membership_locked
return await self._local_membership_update(
File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 437, in _local_membership_update
result_event = await self.event_creation_handler.handle_new_client_event(
File "/usr/lib/python3.10/site-packages/synapse/util/metrics.py", line 113, in measured_func
r = await func(self, *args, **kwargs)
File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1382, in handle_new_client_event
result, _ = await make_deferred_yieldable(
File "/usr/lib/python3.10/site-packages/twisted/internet/defer.py", line 1696, in _inlineCallbacks
result = context.run(gen.send, result)
File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1479, in _persist_events
event = await self.persist_and_notify_client_events(
File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1871, in persist_and_notify_client_events
) = await self._storage_controllers.persistence.persist_events(
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 414, in persist_events
ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
File "/usr/lib/python3.10/site-packages/synapse/util/async_helpers.py", line 291, in yieldable_gather_results
raise dfe.subFailure.value from None
File "/usr/lib/python3.10/site-packages/twisted/internet/defer.py", line 1692, in _inlineCallbacks
result = context.run(
File "/usr/lib/python3.10/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 409, in enqueue
return await self._event_persist_queue.add_to_queue(
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 240, in add_to_queue
res = await make_deferred_yieldable(end_item.deferred.observe())
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 283, in handle_queue_loop
ret = await self._per_item_callback(room_id, item.task)
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 355, in _process_event_persist_queue_task
return await self._persist_event_batch(room_id, task)
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 673, in _persist_event_batch
res = await self._get_new_state_after_events(
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 878, in _get_new_state_after_events
old_state_groups = {
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 879, in <setcomp>
event_id_to_state_group[evid] for evid in old_latest_event_ids
KeyError: 'KEY'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.10/site-packages/synapse/handlers/pagination.py", line 677, in _shutdown_and_purge_room
raise SynapseError(
synapse.api.errors.SynapseError: 400: Users are still joined to this room
At the moment we're having these errors:
2022-10-25 01:49:48,161 - synapse.federation.federation_server - 1216 - ERROR - _process_incoming_pdus_in_room_inner-5694 - Failed to handle PDU $5ot2eqxV4FTQQuHYwchhZHvNraGc-In7uL3hDH0PUKY
Traceback (most recent call last):
File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 276, in on_receive_pdu
await self._get_missing_events_for_pdu(
File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 752, in _get_missing_events_for_pdu
await self._process_pulled_events(origin, missing_events, backfilled=False)
File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 800, in _process_pulled_events
await self._process_pulled_event(origin, ev, backfilled=backfilled)
File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 875, in _process_pulled_event
await self._process_received_pdu(
File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 1281, in _process_received_pdu
await self._run_push_actions_and_persist_event(event, context, backfilled)
File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 2125, in _run_push_actions_and_persist_event
await self.persist_events_and_notify(
File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 2186, in persist_events_and_notify
) = await self._storage_controllers.persistence.persist_events(
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 414, in persist_events
ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
File "/usr/lib/python3.10/site-packages/synapse/util/async_helpers.py", line 291, in yieldable_gather_results
raise dfe.subFailure.value from None
File "/usr/lib/python3.10/site-packages/twisted/internet/defer.py", line 1692, in _inlineCallbacks
result = context.run(
File "/usr/lib/python3.10/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 409, in enqueue
return await self._event_persist_queue.add_to_queue(
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 240, in add_to_queue
res = await make_deferred_yieldable(end_item.deferred.observe())
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 283, in handle_queue_loop
ret = await self._per_item_callback(room_id, item.task)
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 355, in _process_event_persist_queue_task
return await self._persist_event_batch(room_id, task)
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 673, in _persist_event_batch
res = await self._get_new_state_after_events(
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 878, in _get_new_state_after_events
old_state_groups = {
File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 879, in <setcomp>
event_id_to_state_group[evid] for evid in old_latest_event_ids
KeyError: '$NK5I-eXXgK-2xm85ZGewUnFBFWZC5YkA3iacyFuCvqs'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.10/site-packages/synapse/federation/federation_server.py", line 1206, in _process_incoming_pdus_in_room_inner
await self._federation_event_handler.on_receive_pdu(
File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 280, in on_receive_pdu
raise Exception(
Exception: Error fetching missing prev_events for $5ot2eqxV4FTQQuHYwchhZHvNraGc-In7uL3hDH0PUKY: '$NK5I-eXXgK-2xm85ZGewUnFBFWZC5YkA3iacyFuCvqs'
And a lot of those
2022-10-25 01:56:11,270 - synapse.handlers.device - 1129 - WARNING - _maybe_retry_device_resync-114 - Failed to handle device list update for @USER: 403: Forbidden
2022-10-25 01:56:11,388 - synapse.http.matrixfederationclient - 709 - WARNING - federation_transaction_transmission_loop-254477 - {PUT-O-2497351} [otherServer] Request failed: PUT matrix://otherServer/_matrix/federation/v1/s end/1666633335267: HttpResponseException('403: Forbidden')
Dunno if that would be technically possible but if we could have a maintenance mode (which takes the above into account and so stops almost all communication, federation etc) the retention tasks could be done safely, no?
I would greatly accept these regular "downtimes" to avoid the mentioned issues but disabling retention is impossible for me..
I think we can mitigate this a lot by taking the following steps:
- [x] Drop all caches related to events when we purge history. We should also purge state caches when we delete a room. c.f. #15609 for WIP PR.
- [ ] Before inserting into
event_forward_extremities
/event_backward_extremities
/etc tables, check that the events still exist in the DB (by doing another select). We already do this when inserting new state groups for example.- We should ideally go through the list of tables that have an event ID column and do the same thing there.
- If the check fails we should raise an error. This will likely cause us to drop the events entirely, for clients means returning an error, for federation means dropping them.
- These checks could also take the form of adding a foreign key constraint to the relevant columns
- [x] #15751
- [x] Add a cross-worker read/write style lock to ensure that we don't purge history while: 1) creating/persisting an event, 2) backfilling, 3) handling new inbound federation events. Note the cross-worker locks are best effort, so need to be combined with check above to be safe. This would take two steps:
- [x] Add support for waiting on locks to current mechanism. Easiest way is to sit in a loop that tries to acquire the lock every e.g. 100ms, and add a hook to notifier/replication to notify when locks are dropped.
- [x] Add support for read/write locks (trying to ensure reads are fast). This would likely need a different table to the existing lock mechanism.
- [x] Fix
POST /read_markers
to handle the case where existing read markers point to deleted events. We try and read the event to stop clients from replacing read markers with older events, so its probably safe to assume that if we no longer have the existing event then its sufficiently old that we skip the check. #15464 - [x] Fix push after purging history #12880. There may be a similar issue with calculating notification counts.
- [x] https://github.com/matrix-org/sytest/pull/1361 - Already done, we just need to add a regression test
Note: there are likely more races here that we don't know about, and even if we didn't I don't think implementing all the above steps will necessary fix all the known brokenness. It should massively reduce the frequency of the problems though.
Hi,
I understand the change #15609 will help in terms of reducing the race condition on the DB (correct me if I am wrong)
When upgrading to synapse 1.83.0, we have faced the situation where the DELETE
statement in the event
table were blocked whereas it was not the case before with synapse 1.59.0
Having in mind that the purge jobs can cause database corruptions, we are disabling the retention policy.
Do you advice that we can re-activate the retention policy after #15609 ?
I just stumbled across this issue this morning, when I had some outages on my server. Is there anything I can do to find the affected rooms and/or to fix the database again - except restoring a backup?
Instead of deleting from the DB, can it be just marked as purged and filtered out while trying to access through API call. Deactivated events, something similar to deactivated users. Pagination should stop once it reaches the purged event in a timeline.
@toshanmugaraj what you're describing are effectively "redacted" events. These are events that maintain their position in the room's Directed Acyclic Graph (DAG), yet the actual message content and other sensitive data have been removed.
Taking this approach to removing data indeed sidesteps the problems in this issue, but this is issue is talking about deleting events entirely -- something you may still want to do to save on disk space. The message retention feature in Synapse does currently delete events entirely. That makes sense to do for the space saving benefit, but is overkill if you're just looking to remove sensitive data from your database.
There is currently work being done by the team to close this issue, meaning that even deleting an event entirely shouldn't cause database corruption. It's a difficult thing to test for success however, as the problem is mainly caused by race conditions.
@samuel-p It's difficult to give general advice on the subject, as everyone's case is different. You try can deleting the affected room entirely, then restarting your server to clear caches. You can then try to rejoin the room (if it was a room with other homeservers in it). Any power levels your users had in the room will remain in-tact.
I have this issue, and i see no way to fix this in the above.
As @DMRobertson announced earlier this month:
In other news, for the last few releases Erik has been working to fix database corruption problems relating to room retention and purging historical events. He put out a call for testing in the Synapse Announcements room last Friday. It is not risk-free, but if anyone is feeling bold and is prepared to handle potential data loss, we would appreciate testing and feedback.
If you're following this issue then do please give it a test and let us know if you have issues.
@erikjohnston said:
I'm fairly confident that the bugs we know about have been fixed
the problem is all the races we don't know about
OTOH, if no one is complaining then hopefully that is good enough
how to test?
I've repeatedly got this error in the log, and I'm running Synapse 1.94.0, so the issue seems not entirely fixed:
synapse_1 | 2023-10-16 08:31:45,686 - synapse.federation.federation_server - 1258 - ERROR - _process_incoming_pdus_in_room_inner-558 - Failed to handle PDU $RcbEyjzcF6ie0jTUyFILm4eN3iOzO72VpCdsc3ezSp4
synapse_1 | Traceback (most recent call last):
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 285, in on_receive_pdu
synapse_1 | await self._get_missing_events_for_pdu(
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1 | return await func(*args, **kwargs)
synapse_1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 804, in _get_missing_events_for_pdu
synapse_1 | await self._process_pulled_events(origin, missing_events, backfilled=False)
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1 | return await func(*args, **kwargs)
synapse_1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 937, in _process_pulled_events
synapse_1 | await _process_new_pulled_events(fresh_events)
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 894, in _process_new_pulled_events
synapse_1 | await self._process_pulled_event(origin, ev, backfilled=backfilled)
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1 | return await func(*args, **kwargs)
synapse_1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1 | return await func(*args, **kwargs)
synapse_1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 1000, in _process_pulled_event
synapse_1 | await self._process_received_pdu(
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1 | return await func(*args, **kwargs)
synapse_1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 1441, in _process_received_pdu
synapse_1 | await self._run_push_actions_and_persist_event(event, context, backfilled)
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1 | return await func(*args, **kwargs)
synapse_1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 2215, in _run_push_actions_and_persist_event
synapse_1 | await self.persist_events_and_notify(
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 2276, in persist_events_and_notify
synapse_1 | ) = await self._storage_controllers.persistence.persist_events(
synapse_1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1 | return await func(*args, **kwargs)
synapse_1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 421, in persist_events
synapse_1 | ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
synapse_1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/util/async_helpers.py", line 307, in yieldable_gather_results
synapse_1 | raise dfe.subFailure.value from None
synapse_1 | File "/usr/local/lib/python3.11/site-packages/twisted/internet/defer.py", line 1993, in _inlineCallbacks
synapse_1 | result = context.run(
synapse_1 | ^^^^^^^^^^^^
synapse_1 | File "/usr/local/lib/python3.11/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
synapse_1 | return g.throw(self.type, self.value, self.tb)
synapse_1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 416, in enqueue
synapse_1 | return await self._event_persist_queue.add_to_queue(
synapse_1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 239, in add_to_queue
synapse_1 | res = await make_deferred_yieldable(end_item.deferred.observe())
synapse_1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 282, in handle_queue_loop
synapse_1 | ret = await self._per_item_callback(room_id, item.task)
synapse_1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 362, in _process_event_persist_queue_task
synapse_1 | return await self._persist_event_batch(room_id, task)
synapse_1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 680, in _persist_event_batch
synapse_1 | res = await self._get_new_state_after_events(
synapse_1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 880, in _get_new_state_after_events
synapse_1 | old_state_groups = {
synapse_1 | ^
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 881, in <setcomp>
synapse_1 | event_id_to_state_group[evid] for evid in old_latest_event_ids
synapse_1 | ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
synapse_1 | KeyError: '$Kk0mN4xEMgXaJ0AciM_TNMG1h8Eo1ZBLeRLAuqd0tEk'
synapse_1 |
synapse_1 | The above exception was the direct cause of the following exception:
synapse_1 |
synapse_1 | Traceback (most recent call last):
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/federation/federation_server.py", line 1248, in _process_incoming_pdus_in_room_inner
synapse_1 | await self._federation_event_handler.on_receive_pdu(
synapse_1 | File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 289, in on_receive_pdu
synapse_1 | raise Exception(
synapse_1 | Exception: Error fetching missing prev_events for $RcbEyjzcF6ie0jTUyFILm4eN3iOzO72VpCdsc3ezSp4: '$Kk0mN4xEMgXaJ0AciM_TNMG1h8Eo1ZBLeRLAuqd0tEk'
I also got the missing prev_events error after purge history API calls. Running Synapse 1.93.0. It seems the problem still exists. I posted a message in Synapse Admins, but I guess it got buried in other messages.
The first error in the log after the purge history is:
2023-10-08 12:39:42,836 - synapse.federation.federation_server - 1260 - ERROR - _process_incoming_pdus_in_room_inner-10514 - Failed to handle PDU $VLE4qs4mVWxD2Wup11xH0lqXINR9nb8CFo_Oys4GpgM
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/synapse/federation/federation_server.py", line 1250, in _process_incoming_pdus_in_room_inner
await self._federation_event_handler.on_receive_pdu(
File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 328, in on_receive_pdu
await self._process_received_pdu(origin, pdu, context)
...
File "/usr/local/lib/python3.11/site-packages/synapse/storage/database.py", line 205, in commit
self.conn.commit()
psycopg2.errors.ForeignKeyViolation: insert or update on table "event_forward_extremities" violates foreign key constraint "event_forward_extremities_event_id"
DETAIL: Key (event_id)=($0WM-Av1vnFc_yJFA-X_d3MWqJY5mpW0mAomWXtkcm8w) is not present in table "events".
Maybe It is time to reopen this issue.
I've repeatedly got this error in the log, and I'm running Synapse 1.94.0, so the issue seems not entirely fixed…
I remedied my problem by force-removing the affected room. There is a certain chance this particular event is a left-over from a time where the issue wasn't yet fixed.
I'll let you know if it crops up again.
It just cropped up again, after the usual database maintenance.