synapse Regression in federation sender instance calls to get joined profile

v1.61.1 release dramatically reduced federation sender calls to the _get_joined_profile_from_event_id cache (see: https://github.com/matrix-org/synapse/pull/12955#issuecomment-1162800438). This seems to have returned in 1.63.1, the jump below is when we upgraded from 1.61.1 -> 1.63.1:

Aug 03 '22 09:08 Fizzadar

It would appear that matrix.org has also seen something like that:

I don't currently know of any relevant changes that led to this.

This is currently at roughly 1% of DB time on the federation sender, so also not sure it's urgent.

Aug 03 '22 10:08 reivilibre

Out of interest - how many federation senders are running on matrix.org? I wonder if something is weird about our setup; we have 3x of them each doing ~1k/s on this cache.

Aug 12 '22 19:08 Fizzadar

I've spent a while looking into this.

On matrix.org there is a notable uptick in miss rate on 12th July (coinciding with a deployment of these changes, roughly 1.62.0+ to 1.63.0rc1-), followed by a second uptick on 22nd July (coinciding with a deployment with these changes, roughly equivalent to v1.63.0 to v1.64.0rc1). However, both increases are an order of magnitude smaller than those reported by @Fizzadar (300 requests/second vs 2000.) So: bad news for @Fizzadar in that it looks like this gets even worse for 1.64.0.

The federation senders primarily use this lookup to determine which servers they should send events to. Normally, it's expected that the event creators should precache the list of servers in redis, so we'd only expect to see requests to _get_joined_profile_from_event_id where a server list was not found in the redis cache, and indeed we can see a corresponding uptick in cache misses on the get_joined_hosts external cache:

I've not been able to find any obvious reason why we might be seeing more cache misses here. In particular, I'm not seeing any change to the rate of cache population, and given it's a redis cache, I think it's unlikely there will have been any changes to the expiry characteristics. What's particularly odd is that the event_to_prev_state_group hit rate is unchanged, and the two are supposedly populated at pretty much the same time.

It might be instructive to add some logging to report which events are seeing a get_joined_hosts cache miss, and see if we can find any pattern.

@Fizzadar it would be interesting to know if you also see a corresponding pattern in cache misses for the get_joined_hosts external cache. Note that you'll probably have to update the grafana dashboard as I just added a "External cache miss rate" panel.

Aug 13 '22 21:08 richvdh

Out of interest - how many federation senders are running on matrix.org? I wonder if something is weird about our setup; we have 3x of them each doing ~1k/s on this cache.

sorry, missed this. matrix.org apparently has 20 federation senders, handling a total of around 100 outgoing federation PDUs per second. But note that the code that hits get_joined_hosts is done before the work is sharded (ie, each of the federation senders does the lookup for every event).

Aug 15 '22 13:08 richvdh

@richvdh sorry for the delay - some interesting charts from our end. Misses (bump is 1.64 rollout):

Hits for the same period (so we're ~99% hit rate):

And the joined profile cache hits:

What seems crazy to me is we're talking 1/s missed cache hit on get_joined_hosts resulting in ~3-4k requests back to the joined profile query. As far as I know our synapse has no changes within the federation logic. This feels like we're missing the cache for one or two huge rooms (#matrix:matrix.org perhaps?) which is resulting in the profile request.

Not had a chance yet to poke into the changes you linked which could be interesting, I wonder why this is getting missed...

Aug 17 '22 14:08 Fizzadar

Next steps: as per Rich's https://github.com/matrix-org/synapse/issues/13444#issuecomment-1214223590 let's add some log lines to figure out where this is coming from.

Aug 18 '22 15:08 erikjohnston

We're certainly seeing quite a lot of "Unexpectedly did not have cached destinations" lines in the matrix.org logs.

I managed to track a couple down in jaeger. In both cases what appears to have happened is that the call to get_joined_hosts (https://github.com/matrix-org/synapse/blob/v1.68.0rc1/synapse/handlers/message.py#L1468) has taken so long (hundreds or even thousands of milliseconds) that the event has been persisted, and the federation sender has picked it up, before the calculation completes.

I wonder if we have somehow damaged the performance of get_joined_hosts. I note that there have been a few recent PRs in that area such as #13573 and #13569.

Sep 22 '22 22:09 richvdh

I wonder if we have somehow damaged the performance of get_joined_hosts. I note that there have been a few recent PRs in that area such as https://github.com/matrix-org/synapse/pull/13573 and https://github.com/matrix-org/synapse/pull/13569.

except both of those significantly postdate the arrival of this issue :/

Sep 22 '22 22:09 richvdh

Apologies for the very long update on this one - we discovered a quick hack that removed 99% of hits to this function in our setup as the federation senders don't ignore bridged events (https://github.com/beeper/synapse/commit/6a1e423aa247cbb3646618a488487cdb2ceb744a). This doesn't address the increase noticed above though so this issue still remains I believe.

Oct 10 '22 07:10 Fizzadar

I wonder if we have somehow damaged the performance of get_joined_hosts. I note that there have been a few recent PRs in that area such as #13573 and #13569.

except both of those significantly postdate the arrival of this issue :/

It's also plausible that we've sped up event persistence sufficiently in large rooms that it now is faster than doing get_joined_hosts in large rooms? :shrug:

Oct 10 '22 15:10 erikjohnston

It's also plausible that we've sped up event persistence sufficiently in large rooms that it now is faster than doing get_joined_hosts in large rooms?

Possibly...

Is it possible that we now avoid pulling all of the state out of the database during the event-creation path... right up until we attempt to cache_joined_hosts_for_event, when we'll suddenly need to pull the entire membership list out of the database?

Oct 10 '22 16:10 richvdh

synapse synapse copied to clipboard

Regression in federation sender instance calls to get joined profile

synapse
synapse copied to clipboard