synapse icon indicating copy to clipboard operation
synapse copied to clipboard

Split out mutable event content from event cache into new caches that are keyed by room ID

Open anoadragon453 opened this issue 1 year ago • 3 comments

In the hopes of fixing https://github.com/matrix-org/synapse/issues/11521 and paving the way towards an immutable external event cache (https://github.com/matrix-org/synapse/issues/2123), a new architecture for the event cache is proposed.

The current state

Currently there are a couple separate data structures related to caching event contents in memory in Synapse (see code for fetching an event from cache/db here):

  • EventsWorkerStore._get_event_cache - An instance of AsyncLruCache implemented as a map of EventID -> (EventBase, redacted EventBase | None).
    • This cache is populated after fetching an event from the database.
    • Entries in this cache are invalidated when an event is deleted from the database (in most cases, c.f. #11521), redacted or marked as rejected.
    • Entries in this cache are invalidated when the size limit of the LruCache are reached.
  • EventsWorkerStore._event_ref - A WeakValueDictionary which serves as a single point of reference for EventBase's in memory, ensuring that we don't end up with multiple, unnecessary copies of a single EventBase in memory.
    • This data structure is populated after fetched an event from the database.
    • Because this is a WeakValueDictionary, entries in this cache are invalidated when all other references to the EventBase in an entry are gone.
    • Entries in this cache are invalidated when an event is deleted from the database (in most cases, c.f. #11521), redacted or marked as rejected.
    • Entries in this cache are not invalidated when an entry is evicted from EventsWorkerStore._get_event_cache, as something else may still be processing the event, even if it's been removed from that cache.

What's the problem?

See https://github.com/matrix-org/synapse/issues/11521; because each of these caches are keyed by EventID alone, it becomes tricky to invalidate them when all you have is a RoomID (i.e. when purging a room completely). We could query all known events for a room from the database, but that may result in millions of events. Ideally we'd have some map of RoomID -> EventID which only covers the events that are actually currently held in memory. We could then use that to invalidate all three of these caches.

Additionally, as get_event_cache contains mutable EventCacheEntrys (comprised of EventBase, redacted EventBase | None), invalidating them is necessary when an event is both redacted or marked as rejected. These can differ per-homeserver, so removing this component from the cache entries opens up avenues for multiple homeservers sharing the same, immutable event cache.

Proposal

After speaking with @erikjohnston we've (mostly Erik :) came up with the following idea:

  • EventsWorkerStore._get_event_cache would simply become a map of EventID -> EventBase.
  • We add a separate cache which is a nested map of RoomID -> EventID -> {rejected_status: bool, redacted_event_content: Optional[EventBase]}.
    • Entries are added to this map when an event is pulled from the database. We know the RoomID at this point.
    • Entries are not invalidated from this map when an entry is EventsWorkerStore._get_event_cache is invalidated due to hitting the cache size.
    • This does mean that we'll need to know the RoomID when querying for rejected/redacted status though... But we can get that from the event cache?

The beauty of this is that we no longer need to invalidate the _get_event_cache at all (unless the size limit is hit)! Even in the room purge use case! How? Here are some examples of using this system:

Fetch EventID A which is not in-memory

  1. Some calling function asks for EventID A.
  2. This does not exist in _get_event_cache (nor other caches) so we query from the database. The event and related metadata is fetched from the DB (event_json, redactions, rejections) and both the _get_event_cache and event metadata cache are populated.
  3. Return information from the database.

Fetch EventID A which is in-memory

  1. Some calling function asks for EventID A.
  2. This already exists in _get_event_cache, and presumably the metadata cache. We take the RoomID from the EventBase in the _get_event_cache and query the event metadata cache.
  3. Return information from both caches.

EventID A is not in-memory but the event has been purged

  1. Some calling function asks for EventID A.
  2. This already exists in _get_event_cache, and presumably the metadata cache. We take the RoomID from the EventBase in the cache and query the event metadata cache - but uh oh, there's no matching entry in the metadata cache! The event must have been purged.
  3. We invalidate the entry in the event cache as well and return None.

Thus when purging a room, we only need to purge entries in the metadata cache (which we can easily do by RoomID due to the metadata cache's structure). Entries in the get_event_cache and event_ref will be invalidated as they are fetched.

I'm curious for thoughts on whether this sounds reasonable from other members of the Synapse team + cc @Fizzadar.

anoadragon453 avatar Sep 26 '22 19:09 anoadragon453

Questions:

  • You refer to the "metadata caches": is this the proposed new cache, or something else?
  • Are you proposing to retain EventsWorkerStore._event_ref or not?
  • under your third usecase "We invalidate the entry in the event cache as well": by event cache, you mean _get_event_cache? But you've said we never invalidate that?
  • It would be helpful to spell out the exact semantics of the properties of your new cache, especially redacted_event_content. I think you intend for it to be the same as that in EventCacheEntry, but that's not very well documented either.
  • Are you assuming that calling functions will also pass in a Room ID? Or are you proposing to look up in _get_event_cache (or the db) before looking in your new cache? In any case, walking through the different steps in each usecase more explictly would be helpful.
  • Your usecases talk about whether the event is in memory or not, but given you now have two caches with different expiry semantics, it's possible for an event to be in one cache and not the other. It might be better to think in terms of the different combinations here.
  • Could you work through a couple of other usecases; in particular the things that cause invalidation in the caches (redactions, rejections, purges)?

richvdh avatar Sep 28 '22 14:09 richvdh

I've finished a second write up which should hopefully be clearer and answer the above questions.

Proposal

Currently we have both a _get_event_cache LruCache of type dict of EventID-> EventCacheEntry and an _event_ref WeakValueDictionary of EventID -> EventBase. EventCacheEntry is defined as:

@attr.s(slots=True, auto_attribs=True)
class EventCacheEntry:
    event: EventBase
    redacted_event: Optional[EventBase]

Today _get_event_cache tracks both whether the event was redacted (EventCacheEntry.redacted_event) and whether it was rejected (EventCacheEntry.event.rejected_reason). I propose to modify the type of _get_event_cache to be an LruCache of type dict of simply EventID -> EventBase. Entries will now only need to be invalidated if the cache is full/entries expire or if the event changes in the database (it is censored, expires or is purged).

Expiration of an event occurs when the undocumented enable_ephemeral_messages config option is enabled (it's an implementation of MSC2228, self-destructing messages). "Censoring" an event happens after the event has been redacted and the specified redaction_retention_period completes. When an event expires or is censored, it is pruned in the database (we should probably use one term for both of these cases). Redaction and rejected status of an event will no longer be tracked by _get_event_cache. This will instead be recorded in a separate cache.

Therefore, we define a new cache: _event_metadata_cache. It is a double-nested dictionary of RoomID -> EventID -> EventMetadataCacheEntry. This dictionary has no inherent size or time-based invalidation strategies (it's not an instance of LruCache).

EventMetadataCacheEntry looks like:

class EventMetadataCacheEntry:
    redacted_event: Optional[EventBase]
    rejected_reason: Optional[str]

If redacted_event is not None, then the associated event has been redacted. Likewise, if rejected_reason is not None, then the associated event was rejected.

Invalidation scenarios

In each invalidation scenario today we have direct access to the Room ID, therefore we can invalidate all three caches when an event changes: _event_ref, _get_event_cache and _event_metadata_cache.

Event redaction

  • An event is marked for redaction.
  • This happens in PersistEventsStore._persist_events_txn where we have the EventBase and thus access to the Room ID.
  • We currently invalidate the entry in _get_event_cache and _event_ref as they contain redaction metadata.
  • After this proposal we "upsert" (invalidate and then re-add) the _event_metadata_cache entry for this event, setting EventMetadataCacheEntry.redacted_event.
    • Note that we can't just invalidate the entry in the _event_metadata_cache, else we'd then have an event in the _get_event_cache or _event_ref and not in _event_metadata_cache. That should only happen in the case of an event purged by a room purge!

Event expiration or censorship

  • Event "expiration" only occurs when the undocumented enable_ephemeral_messages feature is enabled. The feature prunes the actual event in the database when it expires (this is NOT redaction). The expiration timestamp is listed in the event's content when sent.
    • This occurs in MessageHandler._expire_event where we have the EventBase, and thus the Room ID.
  • Event "censorship" occurs regularly and is driven by the configured redaction_retention_period config option. When an event has been redacted and the configured time has passed, the event will be "censored", aka pruned in the database.
    • This occurs in CensorEventsStore._censor_redactions where we have the EventBase, and thus the Room ID.
  • Since the event body has changed in the database, currently we invalidate the entries in _event_ref and _get_event_cache.
  • After this proposal, we invalidate the entry in _event_metadata_cache as well, even though the event has not been redacted or rejected: it may otherwise never be invalidated and thus take up memory.

Event rejection (and un-rejection)

  • An event is marked as rejected, or no longer rejected when it was previously.
  • This happens in StateGroupWorkerStore._update_state_for_partial_state_event_txn, where we have the EventBase, and thus the Room ID.
  • After this proposal we "upsert" (invalidate and then re-add) the _event_metadata_cache entry for this event, setting EventMetadataCacheEntry.rejected_reason.

Persisting a new event

  • When we persist a new event, we invalidate the event ID in _get_event_cache and _event_ref. If the event is a redaction, we additionally invalidate the event ID that it redacts.
    • Note: ~~For the event being persisted, I'm not actually sure why we invalidate the _get_event_cache and _event_ref in this instance.~~ An event may transition from an outlier to "ex-outlier". For example, when receiving an invite member event, that event is an outlier (we don't have the rest of the DAG to verify it). When we join the room, that event transitions to an ex-outlier. In this case, we update the "outlier" bit in the event's internal metadata, then invalidate the relevant event cache entry.
  • This occurs in PersistEventsStore._persist_events_txn.
  • After this proposal, upon a redaction, we no longer need to invalidate the _get_event_cache and _event_ref of the event that was redacted. Instead, we only need to upsert the record in the _event_metadata_cache. We have access to the room ID of the redacted event to do this.

Partial room purge

  • The room is partially purged from the database, meaning a range of event IDs are purged. This is handled by PurgeEventsStore._purge_history_txn, and both the room ID and event IDs are known.
  • We already invalidate the _get_event_cache and _event_ref here, and could do the same with _event_metadata_cache. As the code notes though, there are race conditions. I'm not convinced that invalidating the cache by room ID helps with that at all.

_get_event_cache (an LruCache) reaches a size or time limit that triggers invalidation

In this case, entries are intentionally left in _event_ref, as the events themselves have not changed. They will eventually be invalidated from _event_ref when no strong references to the event still exist.

However, entries will be left in _event_metadata_cache, which could balloon. This is currently an unsolved problem. Perhaps when an entry is removed from _event_ref through a lack of strong references, that could trigger an invalidation in _event_metadata_cache as well?

Full room purge (new)

  • The room is purged from the database.
  • The room ID is used to invalidate the entry in the top-level dictionary of _event_metadata_cache.
  • When an event is fetched from the _get_event_cache or _event_ref, we extract the room ID and check whether the event is also listed in the _event_metadata_cache. If not, we assume that it has been cleared intentionally and return None (rather than treat it as a cache miss, thus assuming it is also not in the database).

Retrieving an event

Let's take a look at the new flow for retrieving an event from the cache/db:

graph TD;
    A[Calling Function]-->B[look up event ID in _get_event_cache];
    B-->|entry found|C[Check _event_metadata_cache with event.room_id];
    B-->|entry not found|D[look up event ID in _event_ref];
    C-->|entry found|J[populate caches as necessary and return event];
    C-->|entry not found|E[invalidate entry in _get_event_cache and _event_ref];
    D-->|entry found|C;
    D-->|entry not found|F[look up event in the database];
    E-->H[return None];
    F-->|entry found|G[Add entries to _get_event_cache, _event_ref, _event_metadata_cache];
    F-->|entry not found|I[return None];
    G-->K[return event]

in text form:

  • When a calling method fetches an event:
    • attempt to fetch the event from
      • the _get_event_cache or _event_ref cache, if found then
        • using event.room_id, check if an entry for the room ID and event ID exists in _event_metadata_cache
          • if so, populate caches as necessary (populate _get_event_cache if the event was found in _event_ref and return the (potentially redacted) cached EventBase.
          • if not, invalidate the entry from _get_event_cache and _event_ref
          • return None.
      • otherwise check the database, if found then
        • insert an entry into _get_event_cache and _event_ref
        • using event.room_id, insert an entry into _event_metadata_cache
        • return the (potentially redacted) EventBase.
      • otherwise return None.

To invalidate an event in _get_event_cache and _event_ref correctly, the relevant entry needs to be invalidated in _event_metadata_cache.


Does this sound like it could work? The benefits are a _get_event_cache that does not need to be invalidated upon changing the redaction or rejection status of an event. This aides in the case of sharing a cache between homeservers (though all homeservers would need to have configured redaction_retention_period the same). I'm not hoping to solve that problem here, just help lay the groundwork for it.

The real goal is to allow removing all events in a room from the _get_event_cache and _event_ref, without any loss of functionality/theoretical performance.

anoadragon453 avatar Oct 04 '22 14:10 anoadragon453

Note: For the event being persisted, I'm not actually sure why we invalidate the _get_event_cache and _event_ref in this instance. Perhaps "just in case"?

One reason is that an event can go from outlier to non-outlier (de-outlier). See just below where we update that outlier status then invalidate_get_event_cache_after_txn runs after the txn.

https://github.com/matrix-org/synapse/blob/2c237debd3476bcc45a76e360b0cb33032b23045/synapse/storage/databases/main/events.py#L451-L455

MadLittleMods avatar Oct 06 '22 19:10 MadLittleMods

@MadLittleMods Ah right! I missed that it would be running after the txn, and thus after everything below it had run as well. And thanks for pointing out another invalidation case!

anoadragon453 avatar Oct 07 '22 09:10 anoadragon453

It does beg the question of whether we should also store outlier status in the proposed _event_metadata_cache, instead of the event cache.

When we transition an event from outlier to ex-outer (PersistEventsStore._update_outliers_txn), we don't overwrite the contents of the event in the database - we only update the "outlier" bit in its internal metadata (which requires invalidating _get_event_cache and _event_ref.

Perhaps we could store outlier status in _event_metadata_cache as well then?

anoadragon453 avatar Oct 07 '22 09:10 anoadragon453

If the retrieval is transparent then that sounds fine 🤷

MadLittleMods avatar Oct 07 '22 16:10 MadLittleMods

@richvdh and I spoke out of band to discuss this. Out of that meeting came a couple simple changes that will resolve https://github.com/matrix-org/synapse/issues/11521: https://github.com/matrix-org/synapse/pull/14161 and https://github.com/matrix-org/synapse/pull/14164. Assuming those merge, that will unblock me for https://github.com/matrix-org/synapse/issues/4720, which was the current motivation for this change.

Those PRs don't allow correctly invalidating the event caches when purging a room though - they just eliminate some of the symptoms. To fix the cause, we still need the ability to invalidate the event caches by room ID.

Some notes on augmenting the above proposal as a result of our chat (full minutes):

  • We should invalidate entries in the event cache immediately when purging them from a room. That way things that access the event caches directly won't get confused (ok, well maybe we can just stop doing that. Also making sure we check locally means that we can have an external event cache shared with other homeservers)
  • Above we proposed to keep rejected_reason as a field on EventBase, as it is convenient to use to tie together an event and its rejected state when passing the event around. This does mean that an event in the get_event_cache may have a rejected_reason field that's out of sync with rejected_reason stored in event_metadata_cache. If something reaches into the _get_event_cache without consulting event_metadata_cache, then it may get an incorrect/outdated value for rejected_reason.
    • Also note that typically we consider EventBase objects immutable. If we update the rejected_reason field on an EventBase, we'd typically make a new EventBase and replace the old one.
    • Also also note that we track rejections in a table called rejections, but we've been slowly moving this value directly into the events table (via the events_populate_state_key_rejections background update), which makes having multiple homeservers share the same events table harder (noted in https://github.com/matrix-org/synapse/issues/11496#issuecomment-1273581202).
    • It was proposed to not have rejection_reason tracked in event_metadata_cache after all, and instead just keep it in EventBase.
  • Keep redaction status in event_metadata_cache.
  • Note that not all situations require looking up redaction status (sending an event over federation - you just want the requested event and don't care if it's been redacted).

In the end, we may just want to consider having a separate dict of Room ID -> Event ID to fix the problems with invalidating the event caches by room ID. And then a separate amount of work after that to address multi-homeserver caching use cases.

anoadragon453 avatar Oct 13 '22 12:10 anoadragon453

What does "multiple homeservers" mean in this context (and other references like "multi-homeserver", "shared with other homeservers")? How can I share a cache with multiple homeservers? Multiple workers?

MadLittleMods avatar Oct 13 '22 17:10 MadLittleMods

If you think of EMS' use case, where you have many homeservers together in a kubernetes cluster, it'd be nice if all of those homeservers could share a single external event store. And on top of that an external event cache. That way you don't duplicate that information across every one of your hundreds of servers.

It should be feasible as long as the data you store is consistent across all homeservers. Event content is, whether the event has been rejection is not necessarily. The latter is currently stored in the event caches (and now the events table in the database) so moving that to a single store that's shared amongst homeservers is trickier. You also need to think about access controls for events - but if you trust all of the homeservers in your cluster you may be able to get away with just doing that homeserver-side.

anoadragon453 avatar Oct 14 '22 12:10 anoadragon453