swindon icon indicating copy to clipboard operation
swindon copied to clipboard

Allow monitoring `connection refused` on individual destination

Open tailhook opened this issue 8 years ago • 1 comments

Currently we're getting this behavior:

2819  connect(13, {sa_family=AF_INET, sin_port=htons(12345), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
2819  epoll_ctl(3, EPOLL_CTL_ADD, 13, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=2102, u64=2102}}) = 0
2819  epoll_wait(3, {}, 1024, 0)        = 0
2819  epoll_wait(3, {{EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=2102, u64=2102}}}, 1024, 176) = 1
2819  write(5, "\1", 1)                 = 1
2819  epoll_wait(3, {{EPOLLIN, {u32=4294967295, u64=18446744073709551615}}}, 1024, 0) = 1
2819  read(4, "\1", 128)                = 1
2819  read(4, 0x7ffc81ffd8d0, 128)      = -1 EAGAIN (Resource temporarily unavailable)
2819  getsockopt(13, SOL_SOCKET, SO_ERROR, [111], [4]) = 0
2819  close(13) 

Which is fine, except we can't see what specifically is wrong without running strace.

I'm not sure what the best fix is, either exposing this thing in !Status or showing it in metrics or logs, or the combination of all of them.

Update: Also we have queue-size-for-503 of the default 100k which is just too big both for our use case and as a default value.

tailhook avatar Jul 24 '17 13:07 tailhook

Okay the first two things to do are:

  1. Expose metric for the size of the backend queue
  2. Cancel dropped (timed out) connections (to keep queue minimal)

tailhook avatar Jul 24 '17 13:07 tailhook