credhub-release icon indicating copy to clipboard operation
credhub-release copied to clipboard

monit timeout needs to be updated to align with configurable timeout added in 2.6.0 for wait_for_uaa script

Open fbehrens51 opened this issue 5 years ago • 4 comments

What version of the credhub server you are using? 2.6.0

What version of the credhub cli you are using? 2.7.0

If you were attempting to accomplish a task, what was it you were attempting to do? Trying to use the credhub-collacted.yml operation that is part of the concourse-bosh-deployment project (currently working in a fork/locally) to deploy credhub.

bosh deploy -n -d concourse \
     ~/workspace/concourse-bosh-deployment/cluster/concourse.yml \
  -l vars-file.yml \
  -l ~/workspace/concourse-bosh-deployment/versions.yml \
  -o ~/workspace/concourse-bosh-deployment/cluster/operations/basic-auth.yml \
  -o ~/workspace/concourse-bosh-deployment/cluster/operations/enable-lets-encrypt.yml \
  -o ~/workspace/concourse-bosh-deployment/cluster/operations/github-auth.yml \
  -o ~/workspace/concourse-bosh-deployment/cluster/operations/privileged-http.yml \
  -o ~/workspace/concourse-bosh-deployment/cluster/operations/privileged-https.yml \
  -o ~/workspace/concourse-bosh-deployment/cluster/operations/scale.yml \
  -o ~/workspace/concourse-bosh-deployment/cluster/operations/web-network-extension.yml \
  -o ~/workspace/concourse-bosh-deployment/cluster/operations/worker-ephemeral-disk.yml \
  -o ~/workspace/concourse-bosh-deployment/cluster/operations/uaa.yml \
  -o ~/workspace/concourse-bosh-deployment/cluster/operations/secure-internal-postgres.yml \
  -o ~/workspace/concourse-bosh-deployment/cluster/operations/secure-internal-postgres-uaa.yml \
  -o ~/workspace/concourse-bosh-deployment/cluster/operations/credhub-colocated.yml \
  -o ~/workspace/concourse-bosh-deployment/cluster/operations/secure-internal-postgres-credhub.yml \
  -o ~/workspace/concourse-bosh-deployment/cluster/operations/credhub-custom-uaa-wait.yml

where credhub-custom-uaa-wait.yml is:

- type: replace
  path: /instance_groups/name=web/jobs/name=credhub?/properties/credhub/authentication/uaa/wait_for_start_max_timeout?
  value: ((wait_for_start_max_timeout))

- type: replace
  path: /instance_groups/name=web/jobs/name=credhub?/properties/credhub/authentication/uaa/wait_for_start_connect_timeout?
  value: ((wait_for_start_connect_timeout))

and my latest versions.yml is:

# this file is partially maintained by CI; the concourse and garden-runc
# versions and sha1s are automatically bumped, while the rest are preserved
# as-is.
#
# this should make getting started easy while being easy enough to maintain
# manually. feel free to PR sane defaults along with newly supported
# infrastructures and such!
---
concourse_version: '6.2.0'
concourse_sha1: '3c59cac5d5faae5f058fafaa1b501c34b084adba'
bpm_version: '1.1.8'
bpm_sha1: 'c956394fce7e74f741e4ae8c256b480904ad5942'
postgres_version: '41'
postgres_sha1: '4488d08ff54117a9d904f6e2f27c82c1cf4c910e'
windows_utilities_version: '0.11.0'
windows_utilities_sha1: 'efc10ac0f4acae23637ce2c6f864d20df2e3a781'
bbr_sdk_version: '1.15.0'
bbr_sdk_sha1: 'b2d8584dd2ed964c4849cb6d7b536e6cea3e6e8d'
uaa_version: '74.20.0'
uaa_sha1: '0909c912ff4541f4388a0534e5b3b8e3688dc60f'
credhub_version: '2.6.0'
credhub_sha1: 'c45af16ed393bb3cf061b8749e3ee4cae74ce995'

What did you expect to happen? For the credhub job on the web instance to start/deploy successfully

What was the actual behavior? in v2.5.7 (and 2.5.11) credhub job fails most of the time on a deploy or when trying a bosh recreate of the web instance because it takes too long for uaa to start. In those versions, the timeout in the wait_for_uaa is hard coded to 5 seconds. I saw the timeouts in the wait_for_uaa were parameterized in 2.6.0 so I switched to it and set the wait_for_start_max_timeout property set to 120 (current default is 300). The wait_for_uaa is now successful, but the credhub job fails because monit timeout is defaulted to 30 seconds and so it retries to start credhub subsequent times and fails with port (8844) already in use.

Locally, I hacked together a customized version which appears to fix the issue. I used the wait_for_start_max_timeout value to set a timeout value on the start call in the monit file, but could be a new property instead.

<% if p('bpm.enabled') %>
check process credhub
  with pidfile /var/vcap/sys/run/bpm/credhub/credhub.pid
  start program "/var/vcap/jobs/bpm/bin/bpm start credhub" with timeout <%= p("credhub.authentication.uaa.wait_for_start_max_timeout") %> seconds
  stop program "/var/vcap/jobs/bpm/bin/bpm stop credhub"
  group vcap
<% else %>
check process credhub
  with pidfile /var/vcap/sys/run/credhub/pid
  start program "/var/vcap/jobs/credhub/bin/ctl start" with timeout <%= p("credhub.authentication.uaa.wait_for_start_max_timeout") %> seconds
  stop program "/var/vcap/jobs/credhub/bin/ctl stop"
  group vcap
<% end %>

Please confirm where necessary:

  • [ ] I have included a log output
  • [ ] My log includes an error message
  • [ ] I have included steps for reproduction

If you are a PCF customer with an Operation Manager (PCF Ops Manager) please direct your questions to support (https://support.pivotal.io/)

fbehrens51 avatar Jun 11 '20 19:06 fbehrens51

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/173296784

The labels on this github issue will be updated when the story is started.

cf-gitbot avatar Jun 11 '20 19:06 cf-gitbot

Any resolution for this issue? on release version 2.9.0 and this same timeout problem is still an issue.

sonmacharius avatar Nov 09 '20 21:11 sonmacharius

The fix had CI failures - reverted, so reopening the issue.

swalchemist avatar Nov 11 '22 00:11 swalchemist

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

cf-gitbot avatar Nov 11 '22 00:11 cf-gitbot