matrix-media-repo icon indicating copy to clipboard operation
matrix-media-repo copied to clipboard

Long-term goroutine leak, then sudden spike

Open turt2live opened this issue 2 years ago • 2 comments

image

Should be at around 60-80 goroutines, per the start of that graph.

turt2live avatar Sep 20 '23 22:09 turt2live

We see a similar behavior, although the spikes significantly increased in frequency in the last weeks, even though we didn't really change anything. When a spike in goroutine count occurs, MMR also stops to serve media and TCP connections just time out.

image

davidmehren avatar Nov 03 '24 09:11 davidmehren

comments from my duplicate ticket (at this point it is becoming frequent enough that i now have a watchdog script to restart the container if it has too many open goroutines):

two days in a row now i have received reports from other matrix users that images served by my matrix-media-repo backed servers are failing to load.

mmr instances:

mssj.me
jobmachine.org

users reporting errors are from various federated matrix servers, but i was able to reproduce the issue from a matrix.org hosted account.

current architecture is MMR alongside Synapse, both behind Traefik (which handles the appropriate route forwarding to the synapse or media endpoints as documented).

users who have access to their server logs indicate that their federated media calls (from their servers, not the servers listed above) are getting 500 errors after timing out:

2025-04-17 10:54:30,429 - synapse.http.matrixfederationclient - 810 - INFO - GET-106003 - {GET-O-11550} [[mssj.me](https://mssj.me/)] Request failed: GET matrix-federation://[mssj.me/](https://mssj.me/)matrix/federation/v1/media/download/8c9e774b0db7990241b12f0282a01a92285603341912159654478086144?timeoutms=20000: ResponseNeverReceived:[CancelledError()]
2025-04-17 10:54:30,514 - synapse.http.matrixfederationclient - 810 - INFO - GET-106004 - {GET-O-11551} [[jobmachine.org](https://jobmachine.org/)] Request failed: GET matrix-federation://[jobmachine.org/](https://jobmachine.org/)matrix/federation/v1/media/download/fc2d406ba2e465b7342cf01bf349f7e0495344ca1912153864425963520?timeoutms=20000: ResponseNeverReceived:[CancelledError()]
2025-04-17 10:54:30,542 - synapse.http.matrixfederationclient - 810 - INFO - GET-106005 - {GET-O-11552} [[jobmachine.org](https://jobmachine.org/)] Request failed: GET matrix-federation://[jobmachine.org/](https://jobmachine.org/)matrix/federation/v1/media/download/cadba554175de80c1679c0374f279482db0a24e81912147991620222976?timeoutms=20000: ResponseNeverReceived:[CancelledError()]

on my side, i see no attempts to federate for these images in the MMR logs. the only clues i have are in the Traefik logs, which indicate 499 responses (termination of connection remotely) which seems to correlate to the 500 errors the other users are seeing on their end.

in both cases, restarting MMR has returned federation to a functional state.

all other functionality of MMR seems to remain in-tact: users on both of my servers can see images from all other users, including external federated servers.

another MMR user has reported that this may be an issue with an increasing number of goroutines, and has shared a simple restart script that checks the /metrics endpoint for greater-than 100 goroutines.

beyond this, i have no other leads at the time of this writing.

williamkray avatar Apr 25 '25 22:04 williamkray