dd-agent
dd-agent copied to clipboard
[dogstatsd] Can't post payload if system is out of ephemeral ports
I had a process running on one of my servers (Windows 2012 R2 Standard x64) which leaked sockets. Eventually the system ran out of ephemeral ports and dogstatsd started throwing this error:
Traceback (most recent call last):
File "dogstatsd.pyc", line 259, in submit_http
File "requests\api.pyc", line 108, in post
File "requests\api.pyc", line 50, in request
File "requests\sessions.pyc", line 464, in request
File "requests\sessions.pyc", line 576, in send
File "requests\adapters.pyc", line 415, in send
ConnectionError: ('Connection aborted.', error(10048, 'Only one usage of each socket address (protocol/network address/port) is normally permitted'))
and shortly after that:
2016-04-18 08:12:23 Coordinated Universal Time | ERROR | dogstatsd(dogstatsd.pyc:269) | Unable to post payload.
Traceback (most recent call last):
File "dogstatsd.pyc", line 259, in submit_http
File "requests\api.pyc", line 108, in post
File "requests\api.pyc", line 50, in request
File "requests\sessions.pyc", line 464, in request
File "requests\sessions.pyc", line 576, in send
File "requests\adapters.pyc", line 415, in send
ConnectionError: ('Connection aborted.', error(10055, 'An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full'))
which caused the agent to stop sending metrics to Datadog so all my monitors get triggered. Obviously this tells me there's a massive failure, but because there are no metrics I'm pretty much blind to figure out what is actually broken.
I expect a monitoring solution to be robust enough to keep working even in extreme situations like this one.
One potential solution would be to bind() before connect(). I doubt the Python requests
library supports this though...
Hi @mausch, thanks for taking the time to report this.
You're right, setting up bind before connect would require using a lower-level tool than requests
. A solution if you're expected some of these metrics to be always reporting data is to use Notify if data is missing for more than X minutes
. You can find this setting in Set alert conditions
when creating a monitor: https://app.datadoghq.com/monitors#create/metric
Let me know if you have any other question.
Thanks @hkaj , that is the workaround I'm using. But as I said, that only tells me that "there's something wrong" and takes away all the monitors and metrics I've spent so much time working on.
Looks like you can do bind before connect with requests
after all. I asked on twitter and got this gist from @agramajo https://gist.github.com/agramajo/6145047aa49af419c8d2600797cfe752
@hkaj Does the above look reasonable? Should we file a card to review/track?
yeah that makes sense. Adding it to the backlog.