Module does not update new upstream address after a long running time
To be honest, this module is working great and meet all of our expectations except we are currently facing an issue that after a long running time (may be a week), the new upstream can not be detected by dns resolution anymore. I guess that my issue is related to the first issue of https://github.com/ZigzagAK/ngx_dynamic_upstream/issues/7#issuecomment-797814677.
We are implementing this module in the environment that the number of peer of a upstream can be dynamically add or removed due to the real workload (we are using auto scaling group and Docker Swarm). After a long running time, I checked the peer list and all are stale (marked as unhealthy by the dynamic_healthcheck module). It seem likes that this module can not clean up the old stale peer list and update with new peer added when the shared memory is full (fix me if I'm wrong). After nginx -s reload, everything is back to normal.
Secondly, there is a potential risk that the stale IP address can be allocated to another service.
Many thanks,
@truong-hua You may use https://github.com/ZigzagAK/ngx_sysinfo module to check used space in shared zones. Cleanup old states implemented in https://github.com/ZigzagAK/ngx_dynamic_healthcheck/blob/master/src/ngx_dynamic_healthcheck_state.c#L256. Please, check used space and check error.log files when problem will happens in the next time. You may try 2.X.X branch of ngx_dynamic_healthcheck module. This branch use single shared zone for states of all upstreams. Size of this shared zone may be defined with https://github.com/ZigzagAK/ngx_dynamic_healthcheck/blob/2.X.X/src/ngx_http_dynamic_healthcheck.cpp#L44.
But 2.X.X branch may be not stable. Please, try it on yours test zones.
Thank @ZigzagAK, i will give you more information next time. I just checked the old log and there was just a lot of error like this
2021/11/02 06:50:10 [error] 3708#0: [http] dashboard-service: tasks.dashboard:3000 addr=10.0.0.5:3000, fd=4 connect error (111: Connection refused)
I searched and there is no memory related error log. But the peer list response from healthcheck status command is stale, all peer IP address there are old and not work anymore, which causes a lot of healthcheck error like above. After reload the peer list is updated to the right one and working properly if I add or remove another peer to/from DNS.
Will this module be able to clean up stale IP in memory itself when the DNS is updated?
ngx_dynamic_healthcheck generate 'no memory' error in error.log if no space available in shared zone. If you was not found 'no memory', that there wasn't a problem with it. ngx_dynamic_upstream generates 'no shared memory' record in error.log.
But the peer list response from healthcheck status command is stale
List of peers in this response is equal to https://github.com/ZigzagAK/ngx_dynamic_upstream#list ?
Yes, content must be same because healthcheck status traverse upstream peers and find healthcheck status.
https://github.com/ZigzagAK/ngx_dynamic_healthcheck/blob/master/src/ngx_http_dynamic_healthcheck.cpp#L1225
The problem with zero resolved peers may be fixed only if nginx will not interpret result of getaddrinfo as 'host not found' (https://github.com/nginx/nginx/blob/master/src/core/ngx_inet.c#L1137) in all situations. ngx_dynamic_upstream module uses ngx_parse_url from nginx core. nginx hides all of possible errors and map all of them to 'host not found', but reason may be in unavailable in DNS server or temporary DNS server error or many other. In this situation if ngx_dynamic_upstream will drop all peers from upstream it will make all backends unavailable for requests.
If your problem isn't related to this 'feature' of nginx, that may be a bug in ngx_dynamic_upstream.
To help me to understand what is happening you may send me archive of error.log file from start of nginx (or reload). You may send me it directly to my email.
I sent the error log related to your email @ZigzagAK
I can't find any records related to ngx_dynamic_upstream module. Only from ngx_dynamic_healthcheck. On start worker process ngx_dynamic_upstream module write to error.log one of:
- dynamic upstream: using nginx thread pool (https://github.com/ZigzagAK/ngx_dynamic_upstream/blob/master/src/ngx_http_dynamic_upstream_module.cpp#L1266)
- dynamic upstream: using background threads (https://github.com/ZigzagAK/ngx_dynamic_upstream/blob/master/src/ngx_http_dynamic_upstream_module.cpp#L1294) I think, that your nginx wasn't built with this module.
or may be you are using very old version of this module.