Service degredation after too many 400 errors on peer endpoint
Hi all, reporting a defect on acapy 0.11.0 and main.
Summary
After upgrading from 0.10.5 to 0.11.0, our system health-checks reported acapy had stopped responding to requests on both the peer and admin endpoints.
After some investigation, it was discovered that a certain bad request on the peer endpoint would cause the behaviour.
Repro steps
Setup:
- Acapy 0.11.0 running locally
- Configured with askar wallet
- Configured with postgres storage
The issue happens if the peer endpoint is hit with enough requests that result in a HTTP 400 error. For example, hit the service with
while :; do $(curl -F foo=bar http://localhost:8200/); done
and within a few seconds, the request start to take longer and longer to respond until they all start timing out after 10 seconds. Acapy will report the 10 second timeout in its logs.
The service does not appear to recover after entering this state. Note that /status/ready and /status/live endpoints do report that everything is OK regardless.
Quick fix
I could not find anything obvious that would cause this issue when diffing between the 0.10.5 and 0.11.0 tags. I did however find that the dependency update from aries-askar from 0.2.5 to 0.3.0 "caused" the issue.
I found that 0.2.9 is safe from the issue, and for the meantime have gone with that solution in our own deployment.
So there may be an issue with askar 0.3.0 or it may be acapy's usage of 0.3.0 that causes the behaviour. I decided to log the issue on this project because the bug outwardly affects acapy in a pretty serious way.
Thanks
@WadeBarnes @andrewwhitehead — any thoughts on this issue?
Closing as stale -- reopen if needed.