matrix-rust-sdk icon indicating copy to clipboard operation
matrix-rust-sdk copied to clipboard

[meta] Event Cache should directly be integrated inside the sync response flow

Open Hywan opened this issue 4 months ago • 4 comments

Right now, when a sync response is received, it goes in the sync response flow. It starts at Client::receive_sync_response. At the end, it emits a RoomUpdates on a broadcast channel. On the other side of this channel, EventCache is listening.

Why this decoupled design? If the Event Cache is slow, it's not blocking the sync response, so the whole sync loop (!).

It brings some disadvantages though:

  • When an error happens in the Event Cache, it's harder to trace it to the sync response
    • It means we can't “retry” a sync iteration, or do something “proper” with the sync data
  • When Event Cache has performance issues (it happened recently due to regressions), we don't see it immediately because it runs in parallel, and it's harder to correctly connect them to the real reasons
  • It takes more memory, so more energy, than it should:
    • Everything that goes in a channel must be cloned, so all events are duplicated

Proposal:

  • Event Cache is plugged directly inside the sync response flow
    • We can apply the updates on the Event Cache with a parallel iterator per room (hello rayon) to tackle the performance constraint (if the multiple locks allow that, we need to investigate that anyway)
    • We can remove the EventCache::listen_task
      • One less task!
      • Less states in case of channel has lagged or is closed
  • Remove this channel

Advantages:

  • Event cache errors can bubble up to the sync response flow
  • We can decide to retry a sync if we detect an error
  • It saves memory, so energy
  • No magic. Straightforward and trivial flow!

Hywan avatar Aug 07 '25 09:08 Hywan

Thanks for opening this issue after our live discussion!

We can remove the EventCache::listen_task

I wonder if we should keep the EventCache enablement optional. As it happens, it's getting used in more and more places, so other components might start to depend on it by default (e.g. search). But enabling it by default would cause some overhead on sync processing too.

We can apply the updates on the Event Cache with a parallel iterator per room (hello rayon) to tackle the performance constraint (if the multiple locks allow that, we need to investigate that anyway)

Note: I've tried that in the past, and it was a slowdown. Didn't have time to investigate why, but I suspected Some™ lock contention.

bnjbvr avatar Aug 07 '25 10:08 bnjbvr

Note: I've tried that in the past, and it was a slowdown. Didn't have time to investigate why, but I suspected Some™ lock contention.

That's why I was mentioning locks here 😉.

Hywan avatar Aug 07 '25 11:08 Hywan

Re. removing the broadcast channel.. It was added before the event cache was a thing, and I think it's public? So probably good to check with downstreams before removing it.

jplatte avatar Aug 07 '25 11:08 jplatte

@jplatte Sure. Maybe the mechanism will stay, but the Event Cache won't listen to it anymore.

Hywan avatar Aug 11 '25 08:08 Hywan