Add draining for cc_uploader
We propose adding proper draining behavior to the cc_uploader process. This would ensure that ongoing droplet uploads after staging are allowed to complete gracefully when a Cloud Controller (CC) API VM is drained or restarted (e.g., during a BOSH recreate or deployment update). We observed cases where a droplet upload after staging failed because the cc_uploader process was terminated when the CC API VM was updated. We believe implementing a drain script or mechanism for cc_uploader would allow the process to wait for uploads to complete before shutting down.
There was a first approach: https://github.com/cloudfoundry/cloud_controller_ng/pull/4296 https://github.com/cloudfoundry/cc-uploader/pull/195 https://github.com/cloudfoundry/capi-release/pull/529 but we needed to revert capi PR 529 because it broke CF deployments were cc_uploader is not on the same VM as cloud_controller_ng job.
We would like to discuss what implementation options we have.
First thoughts by @Samze were written down in the revert PR:
- Investigate a way to generically drain uploads from any source in nginx rather than specifically in cc_uploader. Current timeout is 10 seconds
- Find a way to sync draining between jobs without requiring co-location.
- Keep the co-location requirement for draining but behind a capi-property.
My thoughts:
From a system-design perspective, I don't think we want to couple the CC Uploader's lifecycle to Cloud Controller. It should be independent and be treated like any other of CC API's clients. Also, the more flexible we can be with job placement on instance groups, the better. That lets us be flexible if we need to re-arrange things in the future (e.g. if certain jobs have more similar scaling needs or for deploy-order reasons).
The jobs that are currently force-colocated to CC (e.g. nginx, local_workers) have a dependency on using the same filesystem at CC API. Looking at the cf-d history, it appears CC uploader was originally co-located with CC API to reduce the number of instance groups, not for any architectural reason.
Furthermore, having CC API kill CC Uploader doesn't solve the problem for uploads generally. For example, external clients could be uploading packages or droplets to CC when it is draining. I'm a bit surprised that we don't already have something in the nginx drain to account for uploads-in-progress.
Also, I don't think that implementation solves the problem for larger deployments, since any CC Uploader can upload to any CC API (there is no same-host affinity). I think it's still possible that a later-rolling CC Uploader uploading to a earlier-rolling CC API could still have its upload cancelled.
Finally, the bosh maintainers are hoping/planning to replace Monit some day, so it'd be helpful to avoid additional usage.
So, as far as a solution, I believe the underlying problem is:
If api VM gets stopped and starts draining, give upload jobs the chance to finish upload, initiate a graceful shutdown, allowing the process to terminate cleanly.
It seems preferable to solve the upload-in-progress drain problem generically in CC API (nginx) vs having CC API reach out to kill CC Uploader.
It looks like there is currently a 10 second graceful shutdown for nginx (assuming that that plays nicely with the nginx upload module, which I would assume until proven otherwise): https://github.com/cloudfoundry/cloud_controller_ng/blob/5c4dac049bd28979284aeab1efa48c7075676131/lib/cloud_controller/drain.rb#L9
The proposed implementation was going to wait 900 seconds (15 minutes) for CC Uploader to drain: https://github.com/cloudfoundry/cloud_controller_ng/blob/5c4dac049bd28979284aeab1efa48c7075676131/lib/cloud_controller/drain.rb#L12
So, one option could be to move the 15 minute drain timeout to nginx, instead of cc uploader. If that long of a drain slows down deploys too much (since it applies to all network connections and could be a DoS vector), maybe there is a sweet spot between 10 seconds and 15 minutes that reduces the number of failed builds, but doesn't slow down the drain too much.
cc-uploader translates a synchronous droplet upload request by Diego (POST /v1/droplet/:guid?cc-droplet-upload-uri=:uri&timeout=900) into an asynchronous droplet upload process as implemented by the CC (POST /internal/v4/droplets/:guid/upload + polling of the returned job id via GET /v3/jobs/:guid).
cc-uploader should have a draining implemented that ensures that a running droplet upload can be finished before shutdown (including polling of the upload job and sending 200 to Diego). 15 min seems to be a good draining timeout (should be configurable and match the timeout configured in Diego for the droplet upload). The cc-uploader draining should be completely independent of CC draining as pointed out by @greg. When using cf-deployment, cc-uploader happens to be deployed on the CC VM but the cc-uploader talks to any available CC. In other words: cc-uploader draining should cover its own processing only while the CC draining takes care that single CC requests finish and that running local worker jobs finish.
Nginx and local worker have draining in place:
- nginx - 30s configurable with cc.nginx_drain_timeout + hardcoded 10s
- local worker - 5min configurable with cc.jobs.local.local_worker_grace_period_seconds
The default nginx draining might be to short for handling droplet uploads gracefully but this can be fine tuned.
New try: https://github.com/cloudfoundry/cc-uploader/pull/245 https://github.com/cloudfoundry/capi-release/pull/562 :)
Any thoughts on Alex's comment on the https://github.com/cloudfoundry/cc-uploader/pull/245 PR regarding ifrit?
ifrit runs counter to idiomatic go. This seems to deepen the dependency on ifrit. The community needs to decide, whether we want to add more ifrit, or go more towards idiomatic go, channels, contexts and goroutines, instead of subprocesses and OS signals.
I'm changing the 'normal' http server for ifrit managed ones, ifrit was introduced already before, but if you would like to avoid ifrit here, we still can change the implementation.