synapse icon indicating copy to clipboard operation
synapse copied to clipboard

Perf improvement to getting auth chains

Open erikjohnston opened this issue 1 year ago • 7 comments

Let's see if this helps.

--

Regression introduced in #17044

Fixes: #17129

Replaces: #17154

erikjohnston avatar May 08 '24 15:05 erikjohnston

@erikjohnston I applied this patch on 1.106.0 and unfortunately it does not really improve. Very long get_auth_chain_difference_chains are back:

Bildschirmfoto 2024-05-09 um 16 39 14

csett86 avatar May 09 '24 14:05 csett86

The old code also removed the chains that _materialize adds to the chains from the chains_to_fetch. Is this behavior covered again?

heftig avatar May 15 '24 23:05 heftig

We deployed this on top of v1.114.0 yesterday and it seems to fix the issue. On the most affected server, our transaction times for get_auth_chain_difference_chains are between 10 seconds and 2 minutes instead of quickly increasing to hours.

Is that a normal range? That's about the same as we measured after rolling back from v1.106.0 to v1.104.0, but all other transaction times are in the milliseconds range.

strutztm avatar Sep 04 '24 06:09 strutztm

We've been testing this in EMS. There were encouraging results for 3 hosts suffering badly from https://github.com/element-hq/synapse/issues/17129 that benefited from this patch. However, further testing with more hosts was not as successful. Several hosts seem to be performing worse, once grinding to a halt with the RDS instance totally pinning itself on CPU. Removing the patch immediately resolved the RDS CPU being hammered.

Screenshot-20241003114819-1751x529

Host names in the linked internal issue.

jaywink avatar Oct 03 '24 09:10 jaywink

I've struggled now for hours with an issue probably related #17470 and/or #17129 with the (currently) latest version 1.122.0.

postgres spent >80% of cpu time in doing some WITH RECURSIVE links(chain_id) queries, but this being at the end just SELECTs, even after hours of debugging I could not understand where the writes come from.

Digging around the github issues I have found this and applying the patch fixed it immediately. So still an issue and might be worth having another look?

it's too late for me (4am), happy that it works now. Maybe with more sleep I'll can read through the issues and place there a better report of my findings.

image

alangecker avatar Jan 26 '25 03:01 alangecker

Any blockers or perceived risks remaining for this PR?

3nprob avatar May 16 '25 00:05 3nprob

Caching doesn't actually work.

if cache:

Will always be "falsey" when it's empty so it never triggers

realtyem avatar May 16 '25 12:05 realtyem