synapse icon indicating copy to clipboard operation
synapse copied to clipboard

Regression in federation sender instance calls to get joined profile

Open Fizzadar opened this issue 2 years ago • 3 comments

v1.61.1 release dramatically reduced federation sender calls to the _get_joined_profile_from_event_id cache (see: https://github.com/matrix-org/synapse/pull/12955#issuecomment-1162800438). This seems to have returned in 1.63.1, the jump below is when we upgraded from 1.61.1 -> 1.63.1:

Screenshot 2022-08-03 at 10 06 19

Fizzadar avatar Aug 03 '22 09:08 Fizzadar

It would appear that matrix.org has also seen something like that:

image image

I don't currently know of any relevant changes that led to this.

This is currently at roughly 1% of DB time on the federation sender, so also not sure it's urgent.

reivilibre avatar Aug 03 '22 10:08 reivilibre

Out of interest - how many federation senders are running on matrix.org? I wonder if something is weird about our setup; we have 3x of them each doing ~1k/s on this cache.

Fizzadar avatar Aug 12 '22 19:08 Fizzadar

I've spent a while looking into this.

On matrix.org there is a notable uptick in miss rate on 12th July (coinciding with a deployment of these changes, roughly 1.62.0+ to 1.63.0rc1-), followed by a second uptick on 22nd July (coinciding with a deployment with these changes, roughly equivalent to v1.63.0 to v1.64.0rc1). However, both increases are an order of magnitude smaller than those reported by @Fizzadar (300 requests/second vs 2000.) So: bad news for @Fizzadar in that it looks like this gets even worse for 1.64.0.

The federation senders primarily use this lookup to determine which servers they should send events to. Normally, it's expected that the event creators should precache the list of servers in redis, so we'd only expect to see requests to _get_joined_profile_from_event_id where a server list was not found in the redis cache, and indeed we can see a corresponding uptick in cache misses on the get_joined_hosts external cache:

image

I've not been able to find any obvious reason why we might be seeing more cache misses here. In particular, I'm not seeing any change to the rate of cache population, and given it's a redis cache, I think it's unlikely there will have been any changes to the expiry characteristics. What's particularly odd is that the event_to_prev_state_group hit rate is unchanged, and the two are supposedly populated at pretty much the same time.

It might be instructive to add some logging to report which events are seeing a get_joined_hosts cache miss, and see if we can find any pattern.

@Fizzadar it would be interesting to know if you also see a corresponding pattern in cache misses for the get_joined_hosts external cache. Note that you'll probably have to update the grafana dashboard as I just added a "External cache miss rate" panel.

richvdh avatar Aug 13 '22 21:08 richvdh

Out of interest - how many federation senders are running on matrix.org? I wonder if something is weird about our setup; we have 3x of them each doing ~1k/s on this cache.

sorry, missed this. matrix.org apparently has 20 federation senders, handling a total of around 100 outgoing federation PDUs per second. But note that the code that hits get_joined_hosts is done before the work is sharded (ie, each of the federation senders does the lookup for every event).

richvdh avatar Aug 15 '22 13:08 richvdh

@richvdh sorry for the delay - some interesting charts from our end. Misses (bump is 1.64 rollout):

Screenshot 2022-08-17 at 15 39 14

Hits for the same period (so we're ~99% hit rate):

Screenshot 2022-08-17 at 15 39 29

And the joined profile cache hits:

Screenshot 2022-08-17 at 15 39 42

What seems crazy to me is we're talking 1/s missed cache hit on get_joined_hosts resulting in ~3-4k requests back to the joined profile query. As far as I know our synapse has no changes within the federation logic. This feels like we're missing the cache for one or two huge rooms (#matrix:matrix.org perhaps?) which is resulting in the profile request.

Not had a chance yet to poke into the changes you linked which could be interesting, I wonder why this is getting missed...

Fizzadar avatar Aug 17 '22 14:08 Fizzadar

Next steps: as per Rich's https://github.com/matrix-org/synapse/issues/13444#issuecomment-1214223590 let's add some log lines to figure out where this is coming from.

erikjohnston avatar Aug 18 '22 15:08 erikjohnston

We're certainly seeing quite a lot of "Unexpectedly did not have cached destinations" lines in the matrix.org logs.

I managed to track a couple down in jaeger. In both cases what appears to have happened is that the call to get_joined_hosts (https://github.com/matrix-org/synapse/blob/v1.68.0rc1/synapse/handlers/message.py#L1468) has taken so long (hundreds or even thousands of milliseconds) that the event has been persisted, and the federation sender has picked it up, before the calculation completes.

I wonder if we have somehow damaged the performance of get_joined_hosts. I note that there have been a few recent PRs in that area such as #13573 and #13569.

richvdh avatar Sep 22 '22 22:09 richvdh

I wonder if we have somehow damaged the performance of get_joined_hosts. I note that there have been a few recent PRs in that area such as https://github.com/matrix-org/synapse/pull/13573 and https://github.com/matrix-org/synapse/pull/13569.

except both of those significantly postdate the arrival of this issue :/

richvdh avatar Sep 22 '22 22:09 richvdh

Apologies for the very long update on this one - we discovered a quick hack that removed 99% of hits to this function in our setup as the federation senders don't ignore bridged events (https://github.com/beeper/synapse/commit/6a1e423aa247cbb3646618a488487cdb2ceb744a). This doesn't address the increase noticed above though so this issue still remains I believe.

Fizzadar avatar Oct 10 '22 07:10 Fizzadar

I wonder if we have somehow damaged the performance of get_joined_hosts. I note that there have been a few recent PRs in that area such as #13573 and #13569.

except both of those significantly postdate the arrival of this issue :/

It's also plausible that we've sped up event persistence sufficiently in large rooms that it now is faster than doing get_joined_hosts in large rooms? :shrug:

erikjohnston avatar Oct 10 '22 15:10 erikjohnston

It's also plausible that we've sped up event persistence sufficiently in large rooms that it now is faster than doing get_joined_hosts in large rooms?

Possibly...

Is it possible that we now avoid pulling all of the state out of the database during the event-creation path... right up until we attempt to cache_joined_hosts_for_event, when we'll suddenly need to pull the entire membership list out of the database?

richvdh avatar Oct 10 '22 16:10 richvdh