capi-release icon indicating copy to clipboard operation
capi-release copied to clipboard

cc.api_post_start_healthcheck_timeout_in_seconds doesn't seem to matter

Open sethboyles opened this issue 2 years ago • 1 comments

Issue

The cc.api_post_start_healthcheck_timeout_in_seconds field doesn't seem to have any effect.

Context

The cc.api_post_start_healthcheck_timeout_in_seconds config field seems to imply that the post-start will fail if CCNG fails report healthy within that time frame:

https://github.com/cloudfoundry/capi-release/blob/6f0f64cd6738fa260294c00315a58658b052f121/jobs/cloud_controller_ng/spec#L349-L351

However, we pushed a changed to CCNG that slept for 10 hours before starting the thin server (in the runner), and this timeout did not get triggered. Instead, 20 minutes passed until both ccng_monit_http_healthcheck and nginx_cc failed.

Task 277 | 22:03:30 | L starting jobs: api/abf448d3-74e1-4086-80c2-130814373e14 (0) (canary)                                                                                                          Task 277 | 22:03:57 | Updating instance scheduler: scheduler/7d611e5a-1ada-41a4-b811-49b5dbcb2b2f (0) (canary) (00:02:09)
Task 277 | 22:23:32 | Updating instance api: api/abf448d3-74e1-4086-80c2-130814373e14 (0) (canary) (00:21:44)
                    L Error: 'api/abf448d3-74e1-4086-80c2-130814373e14 (0)' is not running after update. Review logs for failed jobs: ccng_monit_http_healthcheck, nginx_cc
Task 277 | 22:23:32 | Error: 'api/abf448d3-74e1-4086-80c2-130814373e14 (0)' is not running after update. Review logs for failed jobs: ccng_monit_http_healthcheck, nginx_cc

So it seems this check in the post start script:

https://github.com/cloudfoundry/capi-release/blob/6f0f64cd6738fa260294c00315a58658b052f121/jobs/cloud_controller_ng/templates/post-start.sh.erb#L71

doesn't seem to matter anymore.

However, trying this experiment on older versions of CAPI, we observer that the deploy fails in the post-start script after the configured time.

We believe this regression was introduced in this PR: https://github.com/cloudfoundry/capi-release/pull/195/files, which reconfigured the monit dependencies between the processes.

What we are not clear on is if this is a regression that reintroduced the issue which prompted the introduction of that post-start check in the first place: https://github.com/cloudfoundry/capi-release/issues/125.

Steps to Reproduce

Add a long sleep to this line: https://github.com/cloudfoundry/cloud_controller_ng/blob/main/lib%2Fcloud_controller%2Frunner.rb#L88 and deploy

Expected result

The post-start script to fail after the configured cc.api_post_start_healthcheck_timeout_in_seconds value

Current result

Deploy fails after 20 minutes.

Possible Fix

Not sure, is this something we need to address?

sethboyles avatar Mar 08 '22 23:03 sethboyles

For me the check in the post-start script seems to have been a workaround for the wrong dependency chain. Now the update lifecycle fails at step 4 (monit start) rendering the workaround (i.e. check in step 5) being obsolete. So from my point of view the post-start script could be adjusted and the config property removed.

philippthun avatar Mar 21 '22 12:03 philippthun