ngx_dynamic_upstream Module does not update new upstream address after a long running time

To be honest, this module is working great and meet all of our expectations except we are currently facing an issue that after a long running time (may be a week), the new upstream can not be detected by dns resolution anymore. I guess that my issue is related to the first issue of https://github.com/ZigzagAK/ngx_dynamic_upstream/issues/7#issuecomment-797814677.

We are implementing this module in the environment that the number of peer of a upstream can be dynamically add or removed due to the real workload (we are using auto scaling group and Docker Swarm). After a long running time, I checked the peer list and all are stale (marked as unhealthy by the dynamic_healthcheck module). It seem likes that this module can not clean up the old stale peer list and update with new peer added when the shared memory is full (fix me if I'm wrong). After nginx -s reload, everything is back to normal.

Secondly, there is a potential risk that the stale IP address can be allocated to another service.

Many thanks,

Nov 02 '21 11:11 truong-hua

@truong-hua You may use https://github.com/ZigzagAK/ngx_sysinfo module to check used space in shared zones. Cleanup old states implemented in https://github.com/ZigzagAK/ngx_dynamic_healthcheck/blob/master/src/ngx_dynamic_healthcheck_state.c#L256. Please, check used space and check error.log files when problem will happens in the next time. You may try 2.X.X branch of ngx_dynamic_healthcheck module. This branch use single shared zone for states of all upstreams. Size of this shared zone may be defined with https://github.com/ZigzagAK/ngx_dynamic_healthcheck/blob/2.X.X/src/ngx_http_dynamic_healthcheck.cpp#L44.

Nov 02 '21 12:11 ZigzagAK

But 2.X.X branch may be not stable. Please, try it on yours test zones.

Nov 02 '21 12:11 ZigzagAK

Thank @ZigzagAK, i will give you more information next time. I just checked the old log and there was just a lot of error like this 2021/11/02 06:50:10 [error] 3708#0: [http] dashboard-service: tasks.dashboard:3000 addr=10.0.0.5:3000, fd=4 connect error (111: Connection refused)

I searched and there is no memory related error log. But the peer list response from healthcheck status command is stale, all peer IP address there are old and not work anymore, which causes a lot of healthcheck error like above. After reload the peer list is updated to the right one and working properly if I add or remove another peer to/from DNS.

Will this module be able to clean up stale IP in memory itself when the DNS is updated?

Nov 02 '21 12:11 truong-hua

ngx_dynamic_healthcheck generate 'no memory' error in error.log if no space available in shared zone. If you was not found 'no memory', that there wasn't a problem with it. ngx_dynamic_upstream generates 'no shared memory' record in error.log.

But the peer list response from healthcheck status command is stale

List of peers in this response is equal to https://github.com/ZigzagAK/ngx_dynamic_upstream#list ?

Nov 02 '21 20:11 ZigzagAK

Yes, content must be same because healthcheck status traverse upstream peers and find healthcheck status.

https://github.com/ZigzagAK/ngx_dynamic_healthcheck/blob/master/src/ngx_http_dynamic_healthcheck.cpp#L1225

Nov 02 '21 20:11 ZigzagAK

The problem with zero resolved peers may be fixed only if nginx will not interpret result of getaddrinfo as 'host not found' (https://github.com/nginx/nginx/blob/master/src/core/ngx_inet.c#L1137) in all situations. ngx_dynamic_upstream module uses ngx_parse_url from nginx core. nginx hides all of possible errors and map all of them to 'host not found', but reason may be in unavailable in DNS server or temporary DNS server error or many other. In this situation if ngx_dynamic_upstream will drop all peers from upstream it will make all backends unavailable for requests.

If your problem isn't related to this 'feature' of nginx, that may be a bug in ngx_dynamic_upstream.

To help me to understand what is happening you may send me archive of error.log file from start of nginx (or reload). You may send me it directly to my email.

Nov 02 '21 21:11 ZigzagAK

I sent the error log related to your email @ZigzagAK

Nov 03 '21 09:11 truong-hua

I can't find any records related to ngx_dynamic_upstream module. Only from ngx_dynamic_healthcheck. On start worker process ngx_dynamic_upstream module write to error.log one of:

dynamic upstream: using nginx thread pool (https://github.com/ZigzagAK/ngx_dynamic_upstream/blob/master/src/ngx_http_dynamic_upstream_module.cpp#L1266)
dynamic upstream: using background threads (https://github.com/ZigzagAK/ngx_dynamic_upstream/blob/master/src/ngx_http_dynamic_upstream_module.cpp#L1294) I think, that your nginx wasn't built with this module.

Nov 03 '21 15:11 ZigzagAK

or may be you are using very old version of this module.

Nov 03 '21 15:11 ZigzagAK