Possible problem with requests timing out
It seems l2met is taking a long time to accept some requests, causing them to timeout and trigger Heroku 503 errors. The logs show the 30 seconds or so leading up to the errors, is there any other information that'd be useful?
The instance is running cfa2fc0cebd2586d9e2b9ee94905eed6e8cf5703 on heroku with a 2 line change to measure the size of deadline misses.
@BRMatt Interesting. Can you show me your Procfile and any relevant environment variables?
Procfile:
web: ./l2met -receiver=true -outlet=true -port=$PORT -outlet-ttl=10s -recv-deadline=4
Added the -recv-deadline line this morning after those errors were reported. The env vars are pretty standard - METCHAN_URL, APP_NAME and SECRETS.
Updated the gist with another onslaught of 5XX errors. It seems the app received a large number of log payloads in a short amount of time and was unable to cope with new connections?
@BRMatt How strange. Can you add heroku runtime metrics onto this app? I would be curious to see the metal metrics on this dyno during these turbulent times.
From the logs, it looks like you are doing less than 100 http requests per second. I have benchmarked l2met at much higher throughput.
Sure thing, here're some more logs of about 2 minutes prior to some request timeouts. By the looks of things the dyno's not under any stress at all.
Do you have to perform any actions to bring the system back to a healthy state? How do you recover? Also, how are you noticing these problems?
Do you have to perform any actions to bring the system back to a healthy state? How do you recover?
We don't do anything, the errors are often sporadic. Some seem to be near a dyno restart (in which case we get H13, "Connection closed without response" errors), though most appear to happen randomly.
Also, how are you noticing these problems?
The logs are piped through papertrail and it emails me when l2met returns status codes other than 200.