upstream servers with backup status remain active targets

Open lisp opened this issue 1 year ago • 0 comments

Environment

Include the result of the following commands:

nginx -V

nginx version: nginx/1.23.3
built by gcc 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1) 
built with OpenSSL 1.1.1f  31 Mar 2020
TLS SNI support enabled
configure arguments: --with-cc-opt='-g -O2 -fdebug-prefix-map=/build/nginx-lUTckl/nginx-1.23.3=. -fstack-protector-strong -Wformat -Werror=format-security -fPIC -Wdate-time -D_FORTIFY_SOURCE=2' --with-ld-opt='-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -fPIC' --prefix=/usr/share/nginx --conf-path=/etc/nginx/nginx.conf --http-log-path=/var/log/nginx/access.log --error-log-path=/var/log/nginx/error.log --lock-path=/var/lock/nginx.lock --pid-path=/run/nginx.pid --modules-path=/usr/lib/nginx/modules --http-client-body-temp-path=/var/lib/nginx/body --http-fastcgi-temp-path=/var/lib/nginx/fastcgi --http-proxy-temp-path=/var/lib/nginx/proxy --http-scgi-temp-path=/var/lib/nginx/scgi --http-uwsgi-temp-path=/var/lib/nginx/uwsgi --with-debug --with-compat --with-pcre-jit --with-pcre --with-http_ssl_module --with-http_image_filter_module=dynamic --modules-path=/etc/nginx/modules --with-http_v2_module --with-stream=dynamic --with-http_addition_module --with-http_mp4_module --with-http_stub_status_module --with-http_realip_module --with-http_auth_request_module --with-http_dav_module --with-http_slice_module --with-threads --with-http_gunzip_module --with-http_gzip_static_module --with-http_sub_module --with-http_xslt_module=dynamic --with-stream_ssl_module --with-mail=dynamic --with-mail_ssl_module --add-dynamic-module=../ngx_http_auth_pam_module --add-dynamic-module=../nginx-dav-ext-module --add-dynamic-module=../headers-more-nginx-module

uname -a

Linux nl13.dydra.com 5.11.0-38-generic #42~20.04.1-Ubuntu SMP Tue Sep 28 20:41:07 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Description

i have an upstream clause with two servers. one is the primary and the other is the backup. this is one element of the larger configuration file, which appears below.

upstream Service { 
server 127.0.0.1:8106; 
server 127.0.0.1:8100 backup; 
}

periodically, the primary server marks itself "down", reloads nginx, restarts. the successor process removes the "down" status, reloads nginx, and continues. i observe that sometimes traffic continues to go to the backup server even after the primary service has restarted.

a excerpt of the syslog for the expected behaviour of server processes looks like this where 1676218 took over from 1430500 while the 686680 server was the backup. the respective pid appears in "[...]".

2024-09-14T11:04:55.632496+02:00 [warning] spocq[1430500]: [T] graph-store-response: service quality Service 6913A03E-7278-11EF-AFDA-DDBA947010DB 
2024-09-14T11:05:02.580638+02:00 [warning] spocq[1430500]: [T] graph-store-response: service quality Service 6D37D5B8-7278-11EF-AFDA-DDBA947010DB 
2024-09-14T11:05:09.546325+02:00 [warning] spocq[1430500]: [T] graph-store-response: service quality Service 715EB60C-7278-11EF-AFDA-DDBA947010DB 
2024-09-14T11:05:16.581053+02:00 [warning] spocq[686680]: [T] graph-store-response: service quality Service 759001FE-7278-11EF-AFDA-DDBA947010DB 
2024-09-14T11:05:24.585299+02:00 [warning] spocq[1676218]: [T] graph-store-response: service quality Service 79BD606E-7278-11EF-AFDA-DDBA947010DB 
2024-09-14T11:05:31.578072+02:00 [warning] spocq[1676218]: [T] graph-store-response: service quality Service 7E7B4602-7278-11EF-AFDA-DDBA947010DB 
2024-09-14T11:05:38.436851+02:00 [warning] spocq[1676218]: [T] graph-store-response: service quality Service 8294A440-7278-11EF-AFDA-DDBA947010DB

sometimes, however, the requests continue to appear at the backup server even when the successor primary server is active. in a case where the 1430500 server had started at 2024-09-13T23:34:01, entries for the backup server, 686680, continue to appear hours later, which indicates that nginx continues to target it with requests.

2024-09-14T00:00:07.185312+02:00 [warning] spocq[1430500]: [T] graph-store-response: service quality Service 899C7068-721B-11EF-AFD9-DDBA947010DB 
2024-09-14T00:00:20.950918+02:00 [warning] spocq[1430500]: [T] graph-store-response: service quality Service 91E9FD26-721B-11EF-AFD9-DDBA947010DB 
2024-09-14T01:03:58.710002+02:00 [warning] spocq[1430500]: [T] graph-store-response: service quality Service 75307B84-7224-11EF-AFD9-DDBA947010DB 
2024-09-14T01:06:00.078388+02:00 [warning] spocq[686680]: [T] graph-store-response: service quality Service BDD12802-7224-11EF-AFD9-DDBA947010DB 
2024-09-14T02:00:02.177556+02:00 [warning] spocq[1430500]: [T] graph-store-response: service quality Service 49FD7806-722C-11EF-AFD9-DDBA947010DB 
2024-09-14T04:02:02.273035+02:00 [warning] spocq[1430500]: [T] graph-store-response: service quality Service 55484D92-723D-11EF-AFD9-DDBA947010DB 
... 
2024-09-14T04:42:02.125885+02:00 [warning] spocq[1430500]: [T] graph-store-response: service quality Service EBBF2CC8-7242-11EF-AFD9-DDBA947010DB 
2024-09-14T08:00:37.724415+02:00 [warning] spocq[1430500]: [T] graph-store-response: service quality Service AA09B2B4-725E-11EF-AFD9-DDBA947010DB 
2024-09-14T08:02:39.136554+02:00 [warning] spocq[686680]: [T] graph-store-response: service quality Service F26A7854-725E-11EF-AFD9-DDBA947010DB 
2024-09-14T08:40:07.768077+02:00 [warning] spocq[1430500]: [T] graph-store-response: service quality Service 2EB166CE-7264-11EF-AFD9-DDBA947010DB 
2024-09-14T08:41:20.670565+02:00 [warning] spocq[1430500]: [T] graph-store-response: service quality Service 5A2578F4-7264-11EF-AFD9-DDBA947010DB 
2024-09-14T08:41:39.515107+02:00 [warning] spocq[1430500]: [T] graph-store-response: service quality Service 65642760-7264-11EF-AFD9-DDBA947010DB

[ ] The bug is reproducible with the latest version of nginx it recurs, but infrequently with 1.23.3.
[ ] The nginx configuration is minimized to the smallest possible to reproduce the issue and doesn't contain third-party modules the behaviour is present in a production system.

nginx configuration

the pertinent configuration file is an upstream configuration:

upstream spocq {
  least_conn;
 server 127.0.0.1:8101;
 server 127.0.0.1:8102;
 server 127.0.0.1:8103;
 server 127.0.0.1:8104;
 server 127.0.0.1:8105;
 # 20240726 isolate 8106
 # server 127.0.0.1:8106;
}

upstream Administration {
 server 127.0.0.1:8100;
 server 127.0.0.1:8104 backup;
 server 127.0.0.1:8105 backup;
}
upstream Queued {
 # least_conn; disabled, as the distribution was uneven
 server 127.0.0.1:8101;
 server 127.0.0.1:8102;
 server 127.0.0.1:8103;
 server 127.0.0.1:8104 backup;
}
upstream SPARQL {
 # least_conn; also disabled
 server 127.0.0.1:8104;
 server 127.0.0.1:8105;
 server 127.0.0.1:8103 backup;
}
upstream Scheduled {
 # least_conn;
 server 127.0.0.1:8104;
 server 127.0.0.1:8105;
 server 127.0.0.1:8103 backup;
}
upstream Service {
 server 127.0.0.1:8106;
 server 127.0.0.1:8105 backup;
}

nginx debug log

the behaviour is apparent in syslog content

Oct 08 '24 11:10 lisp