capi-release
capi-release copied to clipboard
cc.api_post_start_healthcheck_timeout_in_seconds doesn't seem to matter
Issue
The cc.api_post_start_healthcheck_timeout_in_seconds
field doesn't seem to have any effect.
Context
The cc.api_post_start_healthcheck_timeout_in_seconds
config field seems to imply that the post-start will fail if CCNG fails report healthy within that time frame:
https://github.com/cloudfoundry/capi-release/blob/6f0f64cd6738fa260294c00315a58658b052f121/jobs/cloud_controller_ng/spec#L349-L351
However, we pushed a changed to CCNG that slept for 10 hours before starting the thin server (in the runner), and this timeout did not get triggered. Instead, 20 minutes passed until both ccng_monit_http_healthcheck
and nginx_cc
failed.
Task 277 | 22:03:30 | L starting jobs: api/abf448d3-74e1-4086-80c2-130814373e14 (0) (canary) Task 277 | 22:03:57 | Updating instance scheduler: scheduler/7d611e5a-1ada-41a4-b811-49b5dbcb2b2f (0) (canary) (00:02:09)
Task 277 | 22:23:32 | Updating instance api: api/abf448d3-74e1-4086-80c2-130814373e14 (0) (canary) (00:21:44)
L Error: 'api/abf448d3-74e1-4086-80c2-130814373e14 (0)' is not running after update. Review logs for failed jobs: ccng_monit_http_healthcheck, nginx_cc
Task 277 | 22:23:32 | Error: 'api/abf448d3-74e1-4086-80c2-130814373e14 (0)' is not running after update. Review logs for failed jobs: ccng_monit_http_healthcheck, nginx_cc
So it seems this check in the post start script:
https://github.com/cloudfoundry/capi-release/blob/6f0f64cd6738fa260294c00315a58658b052f121/jobs/cloud_controller_ng/templates/post-start.sh.erb#L71
doesn't seem to matter anymore.
However, trying this experiment on older versions of CAPI, we observer that the deploy fails in the post-start script after the configured time.
We believe this regression was introduced in this PR: https://github.com/cloudfoundry/capi-release/pull/195/files, which reconfigured the monit dependencies between the processes.
What we are not clear on is if this is a regression that reintroduced the issue which prompted the introduction of that post-start check in the first place: https://github.com/cloudfoundry/capi-release/issues/125.
Steps to Reproduce
Add a long sleep to this line: https://github.com/cloudfoundry/cloud_controller_ng/blob/main/lib%2Fcloud_controller%2Frunner.rb#L88 and deploy
Expected result
The post-start script to fail after the configured cc.api_post_start_healthcheck_timeout_in_seconds
value
Current result
Deploy fails after 20 minutes.
Possible Fix
Not sure, is this something we need to address?
For me the check in the post-start
script seems to have been a workaround for the wrong dependency chain. Now the update lifecycle fails at step 4 (monit start
) rendering the workaround (i.e. check in step 5) being obsolete. So from my point of view the post-start
script could be adjusted and the config property removed.