marathon-lb reload bug

Open Sisphyus opened this issue 7 years ago • 0 comments

Last week when we update a core service in our production environment（build with DC/OS）. we accidentally make a mistake when change the health check configuration. and we get 503 return all the time from external access until we make health check configuration correctly and restart service . the old instance state is always healthy in marathon page. so we think something happened when marathon-lb reload.

why old healthy instance lose efficacy after we make a bad health check ？As we know nothing changed with old healthy instance when we lunch a new unhealthy instance in same application.

Test and Verification（marathon-lb version 1.12.1）

a new nginx（listen 80） test application lunched（health check port 80）
change health check port to 81 （marathon lunch a new instance and its state will never be healthy, at this time the nginx backend in haproxy.cfg has two different server）
test external access

haproxy.cfg

before reload

backend nginx-lbl-test_10278
  balance roundrobin
  mode http
  option forwardfor
  http-request set-header X-Forwarded-Port %[dst_port]
  http-request add-header X-Forwarded-Proto https if { ssl_fc }
  server 10_168_0_82_9_0_5_7_80 9.0.5.7:80 check inter 5s fall 4 port 80

after reload

backend nginx-lbl-test_10278
  balance roundrobin
  mode http
  option forwardfor
  http-request set-header X-Forwarded-Port %[dst_port]
  http-request add-header X-Forwarded-Proto https if { ssl_fc }
  server 10_168_0_82_9_0_5_7_80 9.0.5.7:80 check inter 5s fall 4 port 81
  server 10_168_0_82_9_0_5_12_80 9.0.5.12:80 check inter 5s fall 4 port 81

so why old instance health check configuration also has been updated？

It's terrible when we update some application in production environment. haproxy failover lose efficacy when you make a bad health check even the old healthy instance is still alive.

Oct 15 '18 03:10 Sisphyus