synapse icon indicating copy to clipboard operation
synapse copied to clipboard

Deleting events causes database corruption

Open richvdh opened this issue 2 years ago • 4 comments

Deleting events - whether you do it via the purge_history admin API, the delete room admin API, or by enabling retention on your server - has potential to corrupt the database.

In short, the delete operations can race against other operations (notably event persistence), leading the database to get into an inconsistent state. The main reason is that the delete operations do not invalidate the in-memory caches, but in general there are a bunch of race conditions.

Examples include:

  • https://github.com/matrix-org/synapse/issues/9864
  • https://github.com/matrix-org/synapse/issues/11521
  • non-existent event-forward-extremities (https://github.com/matrix-org/synapse/issues/12507#issuecomment-1142058611), or outlier forward extremities (#13026)

richvdh avatar Aug 08 '22 12:08 richvdh

Other related issues:

  • https://github.com/matrix-org/synapse/issues/11738
  • https://github.com/matrix-org/synapse/issues/12880 (though I don't think that's database corruption, it's just stuff not working)

richvdh avatar Aug 08 '22 12:08 richvdh

Is it a solution to delete the entire cache before deleting? And would that make sense?

dklimpel avatar Aug 08 '22 12:08 dklimpel

Is this related to?

  • #12821
  • #3364

dklimpel avatar Aug 08 '22 13:08 dklimpel

Is it a solution to delete the entire cache before deleting? And would that make sense?

Sadly that's not sufficient, since the cache might be immediately repopulated before the delete actually happens.

Is this related to?

I wouldn't say so. Those are the history purge deleting less stuff than they should; they do not cause inconsistency between Synapse and the database and cannot cause incorrect data in the database.

richvdh avatar Aug 08 '22 18:08 richvdh

the delete operations can race against other operations (notably event persistence)

Incidentally: I think purging a complete room with no local members should be safe from races, but there are still cache coherency bugs (#11521).

richvdh avatar Aug 15 '22 18:08 richvdh

I am a homeserver admin affected by this - database is ever growing and deleting 90+ days old messages is a requirement (privacy aware NGO). Anything I can do to help getting this issue resolved?

rettichschnidi avatar Aug 18 '22 13:08 rettichschnidi

Something like https://github.com/matrix-org/synapse/issues/13916 would help in the case of correctly invalidating the EventsWorkerStore._get_event_cache and EventsWorkerStore._event_ref caches upon fully purging a room. Currently, those caches aren't keyed by room ID, and thus invalidating all event entries related to a room in those caches is not possible.

This leads to coherency bugs such as https://github.com/matrix-org/synapse/issues/11521.

anoadragon453 avatar Oct 18 '22 08:10 anoadragon453

We had a massive increase in the database in the last 15 days, about 50 GB. Now it's over 102 GB with about 200 users (20-30 really use it).

We had retention enabled (before the massive increase) to minimize the history for the user and have more storage at the end.

Now we are having trouble getting rid of some rooms, especially the first three:

              room_id            | num_rows  
----------------------------------------------------------
 Matrix HQ:matrix.org		 |   400673076
 BlackArch:matrix.org		 |     7139888
 Arch Linux:archlinux.org	 |     4855957

The #matrix:matrix.org room is deleted and not listed via admin api anymore but still in the database. Any chance to get rid off the few rows? :)

Is this problem now related to this issue here? We have blocked the matrix room for our users until we can solve the problem.

Dan-Sun avatar Oct 28 '22 01:10 Dan-Sun

Little offtopic, but I wonder how matrix.org handles this massive amount of data? Constantly increasing storage requirement is the main reason I don't "idle" in huge public rooms using my HS account. I really hope that core team prioritizes this issue, because after many years, storage requirement starts to get out of hand :cry:

4nd3r avatar Oct 28 '22 10:10 4nd3r

The #matrix:matrix.org room is deleted and not listed via admin api anymore but still in the database.

@Dan-Sun How exactly did you delete the room?

DMRobertson avatar Oct 28 '22 10:10 DMRobertson

@DMRobertson I tried API v1 and 2 (also using force_purge) and both gave me that there were still users in that room that could not be kicked out. One of them was a deactivated user. So I reactivated this user and tried to leave the room manually and via API, which ended with an internal server error. I asked Travis to try to kick the user for me. I don't know what actually happened in the end, but the user was no longer in the room and I ran API v2 again. It quickly finished the process. And that was really about it. Right now, I'm running the API v2 command every 2 seconds since yesterday. Sometimes it shows that it is still purging, but the row number doesn't change.

We've so far 7 rooms which are still in the database but not via API listed.

Leaving API error

Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/synapse/http/server.py", line 306, in _async_render_wrapper
    callback_return = await self._async_render(request)
  File "/usr/lib/python3.10/site-packages/synapse/http/server.py", line 512, in _async_render
    callback_return = await raw_callback_return
  File "/usr/lib/python3.10/site-packages/synapse/rest/client/room.py", line 995, in on_POST
    await self.room_member_handler.update_membership(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 575, in update_membership
    result = await self.update_membership_locked(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 1032, in update_membership_locked
    return await self._local_membership_update(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 437, in _local_membership_update
    result_event = await self.event_creation_handler.handle_new_client_event(
  File "/usr/lib/python3.10/site-packages/synapse/util/metrics.py", line 113, in measured_func
    r = await func(self, *args, **kwargs)
  File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1382, in handle_new_client_event
    result, _ = await make_deferred_yieldable(
  File "/usr/lib/python3.10/site-packages/twisted/internet/defer.py", line 1696, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1479, in _persist_events
    event = await self.persist_and_notify_client_events(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1871, in persist_and_notify_client_events
    ) = await self._storage_controllers.persistence.persist_events(
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 414, in persist_events
    ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
  File "/usr/lib/python3.10/site-packages/synapse/util/async_helpers.py", line 291, in yieldable_gather_results
    raise dfe.subFailure.value from None
  File "/usr/lib/python3.10/site-packages/twisted/internet/defer.py", line 1692, in _inlineCallbacks
    result = context.run(
  File "/usr/lib/python3.10/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 409, in enqueue
    return await self._event_persist_queue.add_to_queue(
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 240, in add_to_queue
    res = await make_deferred_yieldable(end_item.deferred.observe())
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 283, in handle_queue_loop
    ret = await self._per_item_callback(room_id, item.task)
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 355, in _process_event_persist_queue_task
    return await self._persist_event_batch(room_id, task)
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 673, in _persist_event_batch
    res = await self._get_new_state_after_events(
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 878, in _get_new_state_after_events
    old_state_groups = {
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 879, in <setcomp>
    event_id_to_state_group[evid] for evid in old_latest_event_ids
KeyError: '$9n5rfTl0fNUdVLztun-SdVdWd_zSYjKhgR60p7SUxQs'

Delete room error

Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/synapse/handlers/room.py", line 1879, in shutdown_room
    _, stream_id = await self.room_member_handler.update_membership(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 575, in update_membership
    result = await self.update_membership_locked(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 1032, in update_membership_locked
    return await self._local_membership_update(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 437, in _local_membership_update
    result_event = await self.event_creation_handler.handle_new_client_event(
  File "/usr/lib/python3.10/site-packages/synapse/util/metrics.py", line 113, in measured_func
    r = await func(self, *args, **kwargs)
  File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1382, in handle_new_client_event
    result, _ = await make_deferred_yieldable(
  File "/usr/lib/python3.10/site-packages/twisted/internet/defer.py", line 1696, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1479, in _persist_events
    event = await self.persist_and_notify_client_events(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1871, in persist_and_notify_client_events
    ) = await self._storage_controllers.persistence.persist_events(
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 414, in persist_events
    ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
  File "/usr/lib/python3.10/site-packages/synapse/util/async_helpers.py", line 291, in yieldable_gather_results
    raise dfe.subFailure.value from None
  File "/usr/lib/python3.10/site-packages/twisted/internet/defer.py", line 1692, in _inlineCallbacks
    result = context.run(
  File "/usr/lib/python3.10/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 409, in enqueue
    return await self._event_persist_queue.add_to_queue(
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 240, in add_to_queue
    res = await make_deferred_yieldable(end_item.deferred.observe())
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 283, in handle_queue_loop
    ret = await self._per_item_callback(room_id, item.task)
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 355, in _process_event_persist_queue_task
    return await self._persist_event_batch(room_id, task)
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 673, in _persist_event_batch
    res = await self._get_new_state_after_events(
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 878, in _get_new_state_after_events
    old_state_groups = {
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 879, in <setcomp>
    event_id_to_state_group[evid] for evid in old_latest_event_ids
KeyError: 'KEY'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/synapse/handlers/pagination.py", line 677, in _shutdown_and_purge_room
    raise SynapseError(
synapse.api.errors.SynapseError: 400: Users are still joined to this room

At the moment we're having these errors:

2022-10-25 01:49:48,161 - synapse.federation.federation_server - 1216 - ERROR - _process_incoming_pdus_in_room_inner-5694 - Failed to handle PDU $5ot2eqxV4FTQQuHYwchhZHvNraGc-In7uL3hDH0PUKY
Traceback (most recent call last):
     File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 276, in on_receive_pdu
       await self._get_missing_events_for_pdu(
     File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 752, in _get_missing_events_for_pdu
       await self._process_pulled_events(origin, missing_events, backfilled=False)
     File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 800, in _process_pulled_events
       await self._process_pulled_event(origin, ev, backfilled=backfilled)
    File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 875, in _process_pulled_event 
       await self._process_received_pdu(
     File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 1281, in _process_received_pdu
       await self._run_push_actions_and_persist_event(event, context, backfilled)
     File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 2125, in _run_push_actions_and_persist_event
       await self.persist_events_and_notify(
     File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 2186, in persist_events_and_notify
       ) = await self._storage_controllers.persistence.persist_events(  
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 414, in persist_events
       ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
     File "/usr/lib/python3.10/site-packages/synapse/util/async_helpers.py", line 291, in yieldable_gather_results
       raise dfe.subFailure.value from None
     File "/usr/lib/python3.10/site-packages/twisted/internet/defer.py", line 1692, in _inlineCallbacks
       result = context.run(
     File "/usr/lib/python3.10/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
       return g.throw(self.type, self.value, self.tb)
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 409, in enqueue
       return await self._event_persist_queue.add_to_queue(
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 240, in add_to_queue
       res = await make_deferred_yieldable(end_item.deferred.observe())
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 283, in handle_queue_loop
       ret = await self._per_item_callback(room_id, item.task)
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 355, in _process_event_persist_queue_task
       return await self._persist_event_batch(room_id, task)
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 673, in _persist_event_batch
       res = await self._get_new_state_after_events(
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 878, in _get_new_state_after_events
       old_state_groups = {
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 879, in <setcomp>
       event_id_to_state_group[evid] for evid in old_latest_event_ids
   KeyError: '$NK5I-eXXgK-2xm85ZGewUnFBFWZC5YkA3iacyFuCvqs'
   
   The above exception was the direct cause of the following exception:
   
   Traceback (most recent call last):
     File "/usr/lib/python3.10/site-packages/synapse/federation/federation_server.py", line 1206, in _process_incoming_pdus_in_room_inner
       await self._federation_event_handler.on_receive_pdu(
     File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 280, in on_receive_pdu
       raise Exception(
   Exception: Error fetching missing prev_events for $5ot2eqxV4FTQQuHYwchhZHvNraGc-In7uL3hDH0PUKY: '$NK5I-eXXgK-2xm85ZGewUnFBFWZC5YkA3iacyFuCvqs'

And a lot of those

2022-10-25 01:56:11,270 - synapse.handlers.device - 1129 - WARNING - _maybe_retry_device_resync-114 - Failed to handle device list update for @USER: 403: Forbidden     
2022-10-25 01:56:11,388 - synapse.http.matrixfederationclient - 709 - WARNING - federation_transaction_transmission_loop-254477 - {PUT-O-2497351} [otherServer] Request failed: PUT matrix://otherServer/_matrix/federation/v1/s        end/1666633335267: HttpResponseException('403: Forbidden')

Dan-Sun avatar Oct 28 '22 19:10 Dan-Sun

Dunno if that would be technically possible but if we could have a maintenance mode (which takes the above into account and so stops almost all communication, federation etc) the retention tasks could be done safely, no?

I would greatly accept these regular "downtimes" to avoid the mentioned issues but disabling retention is impossible for me..

steadfasterX avatar Feb 02 '23 11:02 steadfasterX

I think we can mitigate this a lot by taking the following steps:

  • [x] Drop all caches related to events when we purge history. We should also purge state caches when we delete a room. c.f. #15609 for WIP PR.
  • [ ] Before inserting into event_forward_extremities/event_backward_extremities/etc tables, check that the events still exist in the DB (by doing another select). We already do this when inserting new state groups for example.
    • We should ideally go through the list of tables that have an event ID column and do the same thing there.
    • If the check fails we should raise an error. This will likely cause us to drop the events entirely, for clients means returning an error, for federation means dropping them.
    • These checks could also take the form of adding a foreign key constraint to the relevant columns
    • [x] #15751
  • [x] Add a cross-worker read/write style lock to ensure that we don't purge history while: 1) creating/persisting an event, 2) backfilling, 3) handling new inbound federation events. Note the cross-worker locks are best effort, so need to be combined with check above to be safe. This would take two steps:
    • [x] Add support for waiting on locks to current mechanism. Easiest way is to sit in a loop that tries to acquire the lock every e.g. 100ms, and add a hook to notifier/replication to notify when locks are dropped.
    • [x] Add support for read/write locks (trying to ensure reads are fast). This would likely need a different table to the existing lock mechanism.
  • [x] Fix POST /read_markers to handle the case where existing read markers point to deleted events. We try and read the event to stop clients from replacing read markers with older events, so its probably safe to assume that if we no longer have the existing event then its sufficiently old that we skip the check. #15464
  • [x] Fix push after purging history #12880. There may be a similar issue with calculating notification counts.
    • [x] https://github.com/matrix-org/sytest/pull/1361 - Already done, we just need to add a regression test

Note: there are likely more races here that we don't know about, and even if we didn't I don't think implementing all the above steps will necessary fix all the known brokenness. It should massively reduce the frequency of the problems though.

erikjohnston avatar May 17 '23 08:05 erikjohnston

Hi,

I understand the change #15609 will help in terms of reducing the race condition on the DB (correct me if I am wrong)

When upgrading to synapse 1.83.0, we have faced the situation where the DELETE statement in the event table were blocked whereas it was not the case before with synapse 1.59.0

Having in mind that the purge jobs can cause database corruptions, we are disabling the retention policy.

Do you advice that we can re-activate the retention policy after #15609 ?

mcalinghee avatar Jun 08 '23 13:06 mcalinghee

I just stumbled across this issue this morning, when I had some outages on my server. Is there anything I can do to find the affected rooms and/or to fix the database again - except restoring a backup?

samuel-p avatar Jul 15 '23 09:07 samuel-p

Instead of deleting from the DB, can it be just marked as purged and filtered out while trying to access through API call. Deactivated events, something similar to deactivated users. Pagination should stop once it reaches the purged event in a timeline.

toshanmugaraj avatar Aug 01 '23 07:08 toshanmugaraj

@toshanmugaraj what you're describing are effectively "redacted" events. These are events that maintain their position in the room's Directed Acyclic Graph (DAG), yet the actual message content and other sensitive data have been removed.

Taking this approach to removing data indeed sidesteps the problems in this issue, but this is issue is talking about deleting events entirely -- something you may still want to do to save on disk space. The message retention feature in Synapse does currently delete events entirely. That makes sense to do for the space saving benefit, but is overkill if you're just looking to remove sensitive data from your database.

There is currently work being done by the team to close this issue, meaning that even deleting an event entirely shouldn't cause database corruption. It's a difficult thing to test for success however, as the problem is mainly caused by race conditions.

@samuel-p It's difficult to give general advice on the subject, as everyone's case is different. You try can deleting the affected room entirely, then restarting your server to clear caches. You can then try to rejoin the room (if it was a room with other homeservers in it). Any power levels your users had in the room will remain in-tact.

anoadragon453 avatar Aug 01 '23 10:08 anoadragon453

I have this issue, and i see no way to fix this in the above.

ktpx avatar Aug 25 '23 23:08 ktpx

As @DMRobertson announced earlier this month:

In other news, for the last few releases Erik has been working to fix database corruption problems relating to room retention and purging historical events. He put out a call for testing in the Synapse Announcements room last Friday. It is not risk-free, but if anyone is feeling bold and is prepared to handle potential data loss, we would appreciate testing and feedback.

If you're following this issue then do please give it a test and let us know if you have issues.


@erikjohnston said:

I'm fairly confident that the bugs we know about have been fixed

the problem is all the races we don't know about

OTOH, if no one is complaining then hopefully that is good enough

clokep avatar Sep 25 '23 17:09 clokep

how to test?

steadfasterX avatar Sep 26 '23 07:09 steadfasterX

I've repeatedly got this error in the log, and I'm running Synapse 1.94.0, so the issue seems not entirely fixed:

synapse_1   | 2023-10-16 08:31:45,686 - synapse.federation.federation_server - 1258 - ERROR - _process_incoming_pdus_in_room_inner-558 - Failed to handle PDU $RcbEyjzcF6ie0jTUyFILm4eN3iOzO72VpCdsc3ezSp4
synapse_1   | Traceback (most recent call last):
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 285, in on_receive_pdu
synapse_1   |     await self._get_missing_events_for_pdu(
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1   |     return await func(*args, **kwargs)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 804, in _get_missing_events_for_pdu
synapse_1   |     await self._process_pulled_events(origin, missing_events, backfilled=False)
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1   |     return await func(*args, **kwargs)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 937, in _process_pulled_events
synapse_1   |     await _process_new_pulled_events(fresh_events)
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 894, in _process_new_pulled_events
synapse_1   |     await self._process_pulled_event(origin, ev, backfilled=backfilled)
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1   |     return await func(*args, **kwargs)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1   |     return await func(*args, **kwargs)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 1000, in _process_pulled_event
synapse_1   |     await self._process_received_pdu(
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1   |     return await func(*args, **kwargs)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 1441, in _process_received_pdu
synapse_1   |     await self._run_push_actions_and_persist_event(event, context, backfilled)
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1   |     return await func(*args, **kwargs)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 2215, in _run_push_actions_and_persist_event
synapse_1   |     await self.persist_events_and_notify(
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 2276, in persist_events_and_notify
synapse_1   |     ) = await self._storage_controllers.persistence.persist_events(
synapse_1   |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1   |     return await func(*args, **kwargs)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 421, in persist_events
synapse_1   |     ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
synapse_1   |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/util/async_helpers.py", line 307, in yieldable_gather_results
synapse_1   |     raise dfe.subFailure.value from None
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/twisted/internet/defer.py", line 1993, in _inlineCallbacks
synapse_1   |     result = context.run(
synapse_1   |              ^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
synapse_1   |     return g.throw(self.type, self.value, self.tb)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 416, in enqueue
synapse_1   |     return await self._event_persist_queue.add_to_queue(
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 239, in add_to_queue
synapse_1   |     res = await make_deferred_yieldable(end_item.deferred.observe())
synapse_1   |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 282, in handle_queue_loop
synapse_1   |     ret = await self._per_item_callback(room_id, item.task)
synapse_1   |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 362, in _process_event_persist_queue_task
synapse_1   |     return await self._persist_event_batch(room_id, task)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 680, in _persist_event_batch
synapse_1   |     res = await self._get_new_state_after_events(
synapse_1   |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 880, in _get_new_state_after_events
synapse_1   |     old_state_groups = {
synapse_1   |                        ^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 881, in <setcomp>
synapse_1   |     event_id_to_state_group[evid] for evid in old_latest_event_ids
synapse_1   |     ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
synapse_1   | KeyError: '$Kk0mN4xEMgXaJ0AciM_TNMG1h8Eo1ZBLeRLAuqd0tEk'
synapse_1   | 
synapse_1   | The above exception was the direct cause of the following exception:
synapse_1   | 
synapse_1   | Traceback (most recent call last):
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/federation/federation_server.py", line 1248, in _process_incoming_pdus_in_room_inner
synapse_1   |     await self._federation_event_handler.on_receive_pdu(
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 289, in on_receive_pdu
synapse_1   |     raise Exception(
synapse_1   | Exception: Error fetching missing prev_events for $RcbEyjzcF6ie0jTUyFILm4eN3iOzO72VpCdsc3ezSp4: '$Kk0mN4xEMgXaJ0AciM_TNMG1h8Eo1ZBLeRLAuqd0tEk'

schildbach avatar Oct 16 '23 08:10 schildbach

I also got the missing prev_events error after purge history API calls. Running Synapse 1.93.0. It seems the problem still exists. I posted a message in Synapse Admins, but I guess it got buried in other messages.

The first error in the log after the purge history is:

2023-10-08 12:39:42,836 - synapse.federation.federation_server - 1260 - ERROR - _process_incoming_pdus_in_room_inner-10514 - Failed to handle PDU $VLE4qs4mVWxD2Wup11xH0lqXINR9nb8CFo_Oys4GpgM
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/synapse/federation/federation_server.py", line 1250, in _process_incoming_pdus_in_room_inner
    await self._federation_event_handler.on_receive_pdu(
  File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 328, in on_receive_pdu
    await self._process_received_pdu(origin, pdu, context)
...
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/database.py", line 205, in commit
    self.conn.commit()
psycopg2.errors.ForeignKeyViolation: insert or update on table "event_forward_extremities" violates foreign key constraint "event_forward_extremities_event_id"
DETAIL:  Key (event_id)=($0WM-Av1vnFc_yJFA-X_d3MWqJY5mpW0mAomWXtkcm8w) is not present in table "events".

Maybe It is time to reopen this issue.

Wenri avatar Oct 18 '23 04:10 Wenri

I've repeatedly got this error in the log, and I'm running Synapse 1.94.0, so the issue seems not entirely fixed…

I remedied my problem by force-removing the affected room. There is a certain chance this particular event is a left-over from a time where the issue wasn't yet fixed.

I'll let you know if it crops up again.

schildbach avatar Oct 18 '23 10:10 schildbach

It just cropped up again, after the usual database maintenance.

schildbach avatar Dec 03 '23 23:12 schildbach