aries-cloudagent-python Service degredation after too many 400 errors on peer endpoint

Hi all, reporting a defect on acapy 0.11.0 and main.

Summary

After upgrading from 0.10.5 to 0.11.0, our system health-checks reported acapy had stopped responding to requests on both the peer and admin endpoints.

After some investigation, it was discovered that a certain bad request on the peer endpoint would cause the behaviour.

Repro steps

Setup:

Acapy 0.11.0 running locally
Configured with askar wallet
Configured with postgres storage

The issue happens if the peer endpoint is hit with enough requests that result in a HTTP 400 error. For example, hit the service with while :; do $(curl -F foo=bar http://localhost:8200/); done and within a few seconds, the request start to take longer and longer to respond until they all start timing out after 10 seconds. Acapy will report the 10 second timeout in its logs.

The service does not appear to recover after entering this state. Note that /status/ready and /status/live endpoints do report that everything is OK regardless.

Quick fix

I could not find anything obvious that would cause this issue when diffing between the 0.10.5 and 0.11.0 tags. I did however find that the dependency update from aries-askar from 0.2.5 to 0.3.0 "caused" the issue.

I found that 0.2.9 is safe from the issue, and for the meantime have gone with that solution in our own deployment.

So there may be an issue with askar 0.3.0 or it may be acapy's usage of 0.3.0 that causes the behaviour. I decided to log the issue on this project because the bug outwardly affects acapy in a pretty serious way.

Thanks

Feb 12 '24 01:02 dc-anon

@WadeBarnes @andrewwhitehead — any thoughts on this issue?

Feb 13 '24 00:02 swcurran

Closing as stale -- reopen if needed.

Aug 14 '24 22:08 swcurran