nomad
nomad copied to clipboard
Prefetch image before killing the running allocation on task update
Nomad version
Any (in particular, 0.8.7
)
Issue
When task is being updated (especially, when underlying docker image is being updated), nomad kills currently running allocation (according to update policy) and only then starts new allocation. Only then new image is being pulled. This can result in a service interruption in cases when service must be running as a singleton or when update is configured incorrectly (i.e., image failed to pull before update interval): old allocation is already killed, but new one is still pulling the image.
Would it make sense to have additional param in update
stanza that would make nomad prefetch new image and only then start the rollover? Or, if we take it one step further, would it make sense to make nomad kill existing allocation only when new one is reporting "healthy" service checks?
This would certainly help when trying to update system jobs, especially when the image is rather large.
Canary deployment could help with it: canary starting before killing present allocation.
I don't think system tasks support Canary, although according to the documentation I believe that might be supported at a later date.
Some of our system tasks also have static port mapping.
Enterprise support customer here:
This seems like something that would be quite useful for us. We currently deploy these - unfortunately - very massive Windows images that take quite a while to download. This causes fun issues where sometimes the registry times out - Windows image support isn't super great on certain registries, especially for multi-gigabyte images - or the container may fail to start for whatever reason, causing outages if that container cannot run more than 1 allocation at a time for whatever reason.
Ideally, we pull the image prior to a deploy, so that the only thing left is to start the container once the previous one is stopped. This decreases the downtime between deploys, and also ensures that we don't hit pull issues when the cache isn't fresh for some of those larger images.
This issue hits us as well.
In our setup, there is a slow laggy network that we can't control, so the Docker image can pull 2-3 hours and also docker pull could hang or fail with 'EOF' living us in the middle of the 2-3h deployment with a DEAD replica. As the result, we always get HOURS of downtime of each replica on each deploy because Nomad kills the alloc first and only then starts the slow docker pull.
Also, Nomad cleans up all unused Docker images so quickly that when you redeploy the alloc that previously failed or gave up pulling it starts pulling the image again from ground zero because Nomad already cleaned up all that image layers ) this creates an endless loop
This is so strange after the k8s experience because k8s always pull the required images before touching any running container thus reducing the downtime of each container replica. Especially strange when Nomad kills some of your critical allocations, then tries to pull the image.. and pull hangs or fails due to e.g. a bad network or typo in the image tag.
Even if you don't run the app as a singleton and you have faster network or smaller images - why should some replica of your app still be offline for minutes instead of seconds during each deploy while Nomad pulls the image?
Sorry for silence from HashiCorp on this! To be clear we think it's a great idea.
For the most straightforward case where an image is reused when a task is being updated on a node: image_delay
should be sufficient to avoid re-pulling. If it's not working as intended please file a new issue with repro steps and/or logs!
Sadly Nomad still does not have a solution for the more general leaving existing workloads in place until their replacements are ready to run (in particular: have their images pulled). This would require new points of coordination for Nomad:
-
NextAllocWatcher
to block shutdown signals until the replacement alloc has been setup. (The inverse ofPrevAllocWatcher
.) - A
DriverPlugin.Prestart
hook for drivers to perform prestart tasks like image pulling,NextAllocWatcher
on the old alloc would block shutdown untilPrestart
on the new alloc completes. (The new alloc may even proceed to block onPrevAllocWatcher
for the old alloc to shutdown! That should Just Work as implemented today.) - Alternatively add a new artifact management capability to give operators control over when images and other artifacts are pulled and gc'd.
Hack: sysbatch image pulls
Nomad v1.2.0 implemented system batch jobs which like system jobs run on all nodes by default, but like batch jobs are expected to exit and not be restarted if they exited without an error.
You can use a sysbatch job that requires the same image as your service, but specifies a different entrypoint to exit immediately instead of running your target service. Once the sysbatch job completes, you can deploy your service and the image will already be pulled!
The major caveat here is that the image will only be cached for 3 minutes after the sysbatch job exits by default. This can be tuned by configuring image_delay
on a per-node basis, but it still requires you to configure your nodes with knowledge of your workload's image pulling. Very awkward and error prone.
I want a real solution for this, but hopefully this workaround helps some in the mean time.
Any updates on this?
Hi @valafon, we do not have any further update unfortunately. Once we do, a member of the Nomad team will provide an update via this issue.