synapse
                                
                                 synapse copied to clipboard
                                
                                    synapse copied to clipboard
                            
                            
                            
                        Perf improvement to getting auth chains
Let's see if this helps.
--
Regression introduced in #17044
Fixes: #17129
Replaces: #17154
@erikjohnston I applied this patch on 1.106.0 and unfortunately it does not really improve. Very long get_auth_chain_difference_chains are back:
The old code also removed the chains that _materialize adds to the chains from the chains_to_fetch. Is this behavior covered again?
We deployed this on top of v1.114.0 yesterday and it seems to fix the issue. On the most affected server, our transaction times for get_auth_chain_difference_chains are between 10 seconds and 2 minutes instead of quickly increasing to hours.
Is that a normal range? That's about the same as we measured after rolling back from v1.106.0 to v1.104.0, but all other transaction times are in the milliseconds range.
We've been testing this in EMS. There were encouraging results for 3 hosts suffering badly from https://github.com/element-hq/synapse/issues/17129 that benefited from this patch. However, further testing with more hosts was not as successful. Several hosts seem to be performing worse, once grinding to a halt with the RDS instance totally pinning itself on CPU. Removing the patch immediately resolved the RDS CPU being hammered.
Host names in the linked internal issue.
I've struggled now for hours with an issue probably related #17470 and/or #17129 with the (currently) latest version 1.122.0.
postgres spent >80% of cpu time in doing some WITH RECURSIVE links(chain_id) queries, but this being at the end just SELECTs, even after hours of debugging I could not understand where the writes come from.
Digging around the github issues I have found this and applying the patch fixed it immediately. So still an issue and might be worth having another look?
it's too late for me (4am), happy that it works now. Maybe with more sleep I'll can read through the issues and place there a better report of my findings.
Any blockers or perceived risks remaining for this PR?
Caching doesn't actually work.
if cache:
Will always be "falsey" when it's empty so it never triggers