tile38 icon indicating copy to clipboard operation
tile38 copied to clipboard

Connection Refused / Hooks Fails / Intermittent

Open hmarcelodn opened this issue 5 years ago • 7 comments

We are seing intermittent Connection Refused issues of Tile38 running in Kubernetes with python Redis client. Both client/server implements keep_alive and a retry logic. I analized the network with Wireshark to identify anomalies, but everything is fine. On the other hand, HOOKS sometimes, cannot connect either to the APIs (Timeouts) but Tile38 logic seems to retry and it works. It seems to be the only component failing so far. HOOKS fails more often than Redis connections to Tile38. We are keeping a unique connection opened to Tile38. Does someone experiment this before? It happends a few times a day, bit it happens.

hmarcelodn avatar Sep 06 '18 20:09 hmarcelodn

Maybe you are hitting some kind of system limit like the number of opened "files" (nofile) which also includes the opened sockets.

I would examine the system logs.

And look for a recent "linux network tuning" guide for your distribution.

temporaryaccount avatar Sep 07 '18 07:09 temporaryaccount

I don't see any logging information from Tile38. Now it keeps refusing my connectios for more than 12 seconds then starts working again for a few 30/40 minutes. This failed in avg 50 times a day. I am going to run some recurrent task to see if the service goes down in some moment since hooks also fails a lot of times per day.

hmarcelodn avatar Sep 10 '18 11:09 hmarcelodn

It's possible to get a additional logging from Tile38 by using the -vv flag at start up.

tidwall avatar Sep 10 '18 14:09 tidwall

Yes it does not says too much so far than HOOKS failing intermittent:

2018/09/10 03:30:03 [DEBU] Endpoint connect/send error: 1001: http://track-server/api/v2/fence/tile38_webhook/Kd1Tq5j5gx6PWw:a6b72d3c6caa1a715a6bd69db24500c97063e18db60d61756f4296b1ff4407ac: Post http://track-server/api/v2/fence/tile38_webhook/Kd1Tq5j5gx6PWw:a6b72d3c6caa1a715a6bd69db24500c97063e18db60d61756f4296b1ff4407ac: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2018/09/10 03:30:03 [DEBU] opened connection: 100.96.4.122:46072
2018/09/10 03:30:03 [DEBU] Endpoint send ok: 1001: http://track-server/api/v2/fence/tile38_webhook/Kd1Tq5j5gx6PWw:a6b72d3c6caa1a715a6bd69db24500c97063e18db60d61756f4296b1ff4407ac: <nil>
2018/09/10 03:30:04 [DEBU] Endpoint send ok: 1002: http://track-server/api/v2/fence/tile38_webhook/Kd1Tq5j5gx6PWw:a6b72d3c6caa1a715a6bd69db24500c97063e18db60d61756f4296b1ff4407ac: <nil>
2018/09/10 03:31:09 [DEBU] closed connection: 100.96.4.122:60632
2018/09/10 03:31:22 [DEBU] closed connection: 100.96.4.122:60630
2018/09/10 03:31:22 [DEBU] closed connection: 100.96.4.122:60634
2018/09/10 03:31:22 [DEBU] closed connection: 100.96.4.122:60636
2018/09/10 03:33:18 [DEBU] queued hook: 1003
2018/09/10 03:33:18 [DEBU] Endpoint send ok: 1003: http://track-server/api/v2/fence/tile38_webhook/Kd1Tq5j5gx6PWw:a6b72d3c6caa1a715a6bd69db24500c97063e18db60d61756f4296b1ff4407ac: <nil>
2018/09/10 03:42:05 [DEBU] queued hook: 1004
2018/09/10 03:42:10 [DEBU] Endpoint connect/send error: 1004: http://track-server/api/v2/fence/tile38_webhook/Kd1Tq5j5gx6PWw:a6b72d3c6caa1a715a6bd69db24500c97063e18db60d61756f4296b1ff4407ac: Post http://track-server/api/v2/fence/tile38_webhook/Kd1Tq5j5gx6PWw:a6b72d3c6caa1a715a6bd69db24500c97063e18db60d61756f4296b1ff4407ac: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2018/09/10 03:42:10 [DEBU] Endpoint send ok: 1004: http://track-server/api/v2/fence/tile38_webhook/Kd1Tq5j5gx6PWw:a6b72d3c6caa1a715a6bd69db24500c97063e18db60d61756f4296b1ff4407ac: <nil>
2018/09/10 03:42:11 [DEBU] opened connection: 100.96.4.122:48500
2018/09/10 03:42:33 [DEBU] closed connection: 100.96.4.122:46036
2018/09/10 03:47:05 [DEBU] closed connection: 100.96.4.122:46072
2018/09/10 03:47:10 [DEBU] opened connection: 100.96.4.122:49454
2018/09/10 05:04:33 [DEBU] queued hook: 1005
2018/09/10 05:04:34 [DEBU] closed connection: 100.96.4.122:49454
2018/09/10 05:04:36 [DEBU] opened connection: 100.96.4.122:36120
2018/09/10 05:04:38 [DEBU] Endpoint connect/send error: 1005: http://track-server/api/v2/fence/tile38_webhook/XQu0rLR6dEtnSJ:9758ff74492492566123276e217344412fda8c54f6db7b1361c41b62134a3491: Post http://track-server/api/v2/fence/tile38_webhook/XQu0rLR6dEtnSJ:9758ff74492492566123276e217344412fda8c54f6db7b1361c41b62134a3491: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2018/09/10 05:04:39 [DEBU] Endpoint send ok: 1005: http://track-server/api/v2/fence/tile38_webhook/XQu0rLR6dEtnSJ:9758ff74492492566123276e217344412fda8c54f6db7b1361c41b62134a3491: <nil>
2018/09/10 05:04:46 [DEBU] closed connection: 100.96.4.122:48500
2018/09/10 07:30:08 [DEBU] opened connection: 100.96.4.122:35836
2018/09/10 07:30:08 [DEBU] queued hook: 1006
2018/09/10 07:30:08 [DEBU] queued hook: 1007
2018/09/10 07:30:13 [DEBU] Endpoint connect/send error: 1006: http://track-server/api/v2/fence/tile38_webhook/XQu0rLR6dEtnSJ:9758ff74492492566123276e217344412fda8c54f6db7b1361c41b62134a3491: Post http://track-server/api/v2/fence/tile38_webhook/XQu0rLR6dEtnSJ:9758ff74492492566123276e217344412fda8c54f6db7b1361c41b62134a3491: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2018/09/10 07:30:13 [DEBU] Endpoint send ok: 1006: http://track-server/api/v2/fence/tile38_webhook/XQu0rLR6dEtnSJ:9758ff74492492566123276e217344412fda8c54f6db7b1361c41b62134a3491: <nil>
2018/09/10 07:30:13 [DEBU] Endpoint send ok: 1007: http://track-server/api/v2/fence/tile38_webhook/XQu0rLR6dEtnSJ:9758ff74492492566123276e217344412fda8c54f6db7b1361c41b62134a3491: <nil>
2018/09/10 07:33:08 [DEBU] queued hook: 1008
2018/09/10 07:33:13 [DEBU] Endpoint connect/send error: 1008: http://track-server/api/v2/fence/tile38_webhook/XQu0rLR6dEtnSJ:9758ff74492492566123276e217344412fda8c54f6db7b1361c41b62134a3491: Post http://track-server/api/v2/fence/tile38_webhook/XQu0rLR6dEtnSJ:9758ff74492492566123276e217344412fda8c54f6db7b1361c41b62134a3491: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2018/09/10 07:33:13 [DEBU] Endpoint send ok: 1008: http://track-server/api/v2/fence/tile38_webhook/XQu0rLR6dEtnSJ:9758ff74492492566123276e217344412fda8c54f6db7b1361c41b62134a3491: <nil>
2018/09/10 11:00:33 [DEBU] opened connection: 100.96.4.122:48024
2018/09/10 11:00:33 [DEBU] opened connection: 100.96.4.122:48026
2018/09/10 11:01:10 [DEBU] closed connection: 100.96.4.122:36120
2018/09/10 11:01:29 [DEBU] closed connection: 100.96.4.122:35836
2018/09/10 11:08:28 [DEBU] queued hook: 1009
2018/09/10 11:08:29 [DEBU] closed connection: 100.96.4.122:48024
2018/09/10 11:08:31 [DEBU] opened connection: 100.96.4.122:49554
2018/09/10 11:08:31 [DEBU] queued hook: 1010
2018/09/10 11:08:31 [DEBU] queued hook: 1011
2018/09/10 11:08:33 [DEBU] Endpoint connect/send error: 1009: http://track-server/api/v2/fence/tile38_webhook/Bw49uP1BBaAbGL:ae96733cab8a87bb82b979104bd9b97d56af0f61bbb8faf581c5264cb5d519c0: Post http://track-server/api/v2/fence/tile38_webhook/Bw49uP1BBaAbGL:ae96733cab8a87bb82b979104bd9b97d56af0f61bbb8faf581c5264cb5d519c0: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2018/09/10 11:08:33 [DEBU] queued hook: 1012
2018/09/10 11:08:33 [DEBU] queued hook: 1013
2018/09/10 11:08:34 [DEBU] Endpoint send ok: 1010: http://track-server/api/v2/fence/tile38_webhook/ze3HUSkzTSy9m8:6bde8852c9887fe19e6424f133c026d240d900cddc9cb8b3a2cc14ca53b2ec62: <nil>
2018/09/10 11:08:34 [DEBU] queued hook: 1014
2018/09/10 11:08:34 [DEBU] Endpoint send ok: 1013: http://track-server/api/v2/fence/tile38_webhook/ze3HUSkzTSy9m8:6bde8852c9887fe19e6424f133c026d240d900cddc9cb8b3a2cc14ca53b2ec62: <nil>
2018/09/10 11:08:34 [DEBU] Endpoint send ok: 1009: http://track-server/api/v2/fence/tile38_webhook/Bw49uP1BBaAbGL:ae96733cab8a87bb82b979104bd9b97d56af0f61bbb8faf581c5264cb5d519c0: <nil>
2018/09/10 11:08:34 [DEBU] Endpoint send ok: 1011: http://track-server/api/v2/fence/tile38_webhook/Bw49uP1BBaAbGL:ae96733cab8a87bb82b979104bd9b97d56af0f61bbb8faf581c5264cb5d519c0: <nil>
2018/09/10 11:08:34 [DEBU] Endpoint send ok: 1012: http://track-server/api/v2/fence/tile38_webhook/Bw49uP1BBaAbGL:ae96733cab8a87bb82b979104bd9b97d56af0f61bbb8faf581c5264cb5d519c0: <nil>
2018/09/10 11:08:35 [DEBU] Endpoint send ok: 1014: http://track-server/api/v2/fence/tile38_webhook/Bw49uP1BBaAbGL:ae96733cab8a87bb82b979104bd9b97d56af0f61bbb8faf581c5264cb5d519c0: <nil>
2018/09/10 11:09:57 [DEBU] closed connection: 100.96.4.122:48026
2018/09/10 11:11:28 [DEBU] queued hook: 1015
2018/09/10 11:11:28 [DEBU] queued hook: 1016
2018/09/10 11:11:33 [DEBU] Endpoint connect/send error: 1015: http://track-server/api/v2/fence/tile38_webhook/G6JdAiMcYW1ES6:d452f70adbcf4cb726091de40bc162615316dd2c44df0a8b5e78e1ebe686e1c4: Post http://track-server/api/v2/fence/tile38_webhook/G6JdAiMcYW1ES6:d452f70adbcf4cb726091de40bc162615316dd2c44df0a8b5e78e1ebe686e1c4: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2018/09/10 11:11:33 [DEBU] Endpoint connect/send error: 1016: http://track-server/api/v2/fence/tile38_webhook/Bw49uP1BBaAbGL:ae96733cab8a87bb82b979104bd9b97d56af0f61bbb8faf581c5264cb5d519c0: Post http://track-server/api/v2/fence/tile38_webhook/Bw49uP1BBaAbGL:ae96733cab8a87bb82b979104bd9b97d56af0f61bbb8faf581c5264cb5d519c0: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2018/09/10 11:11:34 [DEBU] Endpoint send ok: 1015: http://track-server/api/v2/fence/tile38_webhook/G6JdAiMcYW1ES6:d452f70adbcf4cb726091de40bc162615316dd2c44df0a8b5e78e1ebe686e1c4: <nil>
2018/09/10 11:11:34 [DEBU] Endpoint send ok: 1016: http://track-server/api/v2/fence/tile38_webhook/Bw49uP1BBaAbGL:ae96733cab8a87bb82b979104bd9b97d56af0f61bbb8faf581c5264cb5d519c0: <nil>
2018/09/10 11:53:12 [DEBU] opened connection: 100.96.4.122:58548
2018/09/10 11:53:12 [DEBU] opened connection: 100.96.4.122:58550
2018/09/10 11:53:12 [DEBU] opened connection: 100.96.4.122:58552
2018/09/10 11:54:03 [DEBU] closed connection: 100.96.4.122:49554
2018/09/10 12:10:57 [DEBU] opened connection: 100.96.4.114:58206
2018/09/10 12:10:58 [DEBU] opened connection: 100.96.4.114:58216
2018/09/10 13:48:18 [DEBU] queued hook: 1017
2018/09/10 13:48:23 [DEBU] Endpoint connect/send error: 1017: http://track-server/api/v2/fence/tile38_webhook/Kd1Tq5j5gx6PWw:a6b72d3c6caa1a715a6bd69db24500c97063e18db60d61756f4296b1ff4407ac: Post http://track-server/api/v2/fence/tile38_webhook/Kd1Tq5j5gx6PWw:a6b72d3c6caa1a715a6bd69db24500c97063e18db60d61756f4296b1ff4407ac: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2018/09/10 13:48:23 [DEBU] Endpoint send ok: 1017: http://track-server/api/v2/fence/tile38_webhook/Kd1Tq5j5gx6PWw:a6b72d3c6caa1a715a6bd69db24500c97063e18db60d61756f4296b1ff4407ac: <nil>
2018/09/10 15:49:01 [DEBU] closed connection: 100.96.4.122:58548

No just the hooks, but I am not able to connect sometimes via Redis.

hmarcelodn avatar Sep 10 '18 15:09 hmarcelodn

All the Client.Timeout exceeded while awaiting header are a clue.

I notice that each endpoint has a unique URL with a suffix like:

Bw49uP1BBaAbGL:ae96733cab8a87bb82b979104bd9b97d56af0f61bbb8faf581c5264cb5d519c0

About how many distinct hooks are you working with?

tidwall avatar Sep 12 '18 17:09 tidwall

@tidwall We are about ~14 fence hooks in the environment where I took the logs.

hmarcelodn avatar Sep 13 '18 14:09 hmarcelodn

From the log info I can tell that Tile38 is queuing the hook message locally, at which point it requests to immediately send the message to the endpoint.

2018/09/10 07:33:08 [DEBU] queued hook: 1008

But the request times out after the 5 second. (Messages have a 5 second deadline)

2018/09/10 07:33:13 [DEBU] Endpoint connect/send error: 1008: http://track-server/api/v2/fence/tile38_webhook/XQu0rLR6dEtnSJ:9758ff74492492566123276e217344412fda8c54f6db7b1361c41b62134a3491: Post http://track-server/api/v2/fence/tile38_webhook/XQu0rLR6dEtnSJ:9758ff74492492566123276e217344412fda8c54f6db7b1361c41b62134a3491: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Then Tile38 reconnects to the endpoint and delivers the message.

2018/09/10 07:33:13 [DEBU] Endpoint send ok: 1008: http://track-server/api/v2/fence/tile38_webhook/XQu0rLR6dEtnSJ:9758ff74492492566123276e217344412fda8c54f6db7b1361c41b62134a3491: <nil>

Under the hood Tile38 uses one standard HTTP Go client per hook. The Tile38 client will keep the connection open for as long as the server will allow. So it's possible that many messages can be sent on a single connection.

In your case I suspect that, for each hook connection, the endpoint server is correctly receiving the first message request, and then keeps the connection alive. It's ready to accept more messages, but for some reason the next message is not being handled within the 5 seconds. So Tile38 kills the connection and tries again. Which then succeeds.

I would first double check that the endpoint server can correctly handle multiple requests on a single connection. Then I would make sure that there isn't anything else that's in-between Tile38 and the endpoint server that might be causing the request to hang.

tidwall avatar Sep 13 '18 15:09 tidwall