dd-agent icon indicating copy to clipboard operation
dd-agent copied to clipboard

[dogstatsd] Can't post payload if system is out of ephemeral ports

Open mausch opened this issue 8 years ago • 6 comments

I had a process running on one of my servers (Windows 2012 R2 Standard x64) which leaked sockets. Eventually the system ran out of ephemeral ports and dogstatsd started throwing this error:

Traceback (most recent call last):
  File "dogstatsd.pyc", line 259, in submit_http
  File "requests\api.pyc", line 108, in post
  File "requests\api.pyc", line 50, in request
  File "requests\sessions.pyc", line 464, in request
  File "requests\sessions.pyc", line 576, in send
  File "requests\adapters.pyc", line 415, in send
ConnectionError: ('Connection aborted.', error(10048, 'Only one usage of each socket address (protocol/network address/port) is normally permitted'))

and shortly after that:

2016-04-18 08:12:23 Coordinated Universal Time | ERROR | dogstatsd(dogstatsd.pyc:269) | Unable to post payload.
Traceback (most recent call last):
  File "dogstatsd.pyc", line 259, in submit_http
  File "requests\api.pyc", line 108, in post
  File "requests\api.pyc", line 50, in request
  File "requests\sessions.pyc", line 464, in request
  File "requests\sessions.pyc", line 576, in send
  File "requests\adapters.pyc", line 415, in send
ConnectionError: ('Connection aborted.', error(10055, 'An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full'))

which caused the agent to stop sending metrics to Datadog so all my monitors get triggered. Obviously this tells me there's a massive failure, but because there are no metrics I'm pretty much blind to figure out what is actually broken.

I expect a monitoring solution to be robust enough to keep working even in extreme situations like this one.

One potential solution would be to bind() before connect(). I doubt the Python requests library supports this though...

mausch avatar Apr 18 '16 09:04 mausch

Hi @mausch, thanks for taking the time to report this. You're right, setting up bind before connect would require using a lower-level tool than requests. A solution if you're expected some of these metrics to be always reporting data is to use Notify if data is missing for more than X minutes. You can find this setting in Set alert conditions when creating a monitor: https://app.datadoghq.com/monitors#create/metric

Let me know if you have any other question.

hkaj avatar Apr 18 '16 13:04 hkaj

Thanks @hkaj , that is the workaround I'm using. But as I said, that only tells me that "there's something wrong" and takes away all the monitors and metrics I've spent so much time working on.

mausch avatar Apr 18 '16 13:04 mausch

Looks like you can do bind before connect with requests after all. I asked on twitter and got this gist from @agramajo https://gist.github.com/agramajo/6145047aa49af419c8d2600797cfe752

mausch avatar Apr 18 '16 14:04 mausch

@hkaj Does the above look reasonable? Should we file a card to review/track?

irabinovitch avatar Jun 01 '16 08:06 irabinovitch

yeah that makes sense. Adding it to the backlog.

hkaj avatar Jun 01 '16 14:06 hkaj