[BUG] Scale down sorts by obsolete status but stops/removes containers in the reverse order
Description
This was initially logged in #11460 but only part of it was fixed. The fix applied by #11473 does prevent scale down recreating a container incorrectly, however it didn't solve the issue of the container sorting being wrong when finding which ones need to be removed to scale down.
If I scale up to 2 instances, then scale down to 1, the newest container is the one that's removed instead of the oldest. This causes an issue if you use scale up with no-recreate to deploy the new version of an image, then use scale down to get rid of the old one. Currently the newest gets removed so you're left with the old version of an image running.
The old sort logic before #11473:
sort.Slice(containers, func(i, j int) bool {
return containers[i].Created < containers[j].Created
})
The new sort logic after #11473:
sort.Slice(containers, func(i, j int) bool {
// select obsolete containers first, so they get removed as we scale down
if obsolete, _ := mustRecreate(service, containers[i], recreate); obsolete {
// i is obsolete, so must be first in the list
return true
}
if obsolete, _ := mustRecreate(service, containers[j], recreate); obsolete {
// j is obsolete, so must be first in the list
return false
}
// For up-to-date containers, sort by container number to preserve low-values in container numbers
ni, erri := strconv.Atoi(containers[i].Labels[api.ContainerNumberLabel])
nj, errj := strconv.Atoi(containers[j].Labels[api.ContainerNumberLabel])
if erri == nil && errj == nil {
return ni < nj
}
// If we don't get a container number (?) just sort by creation date
return containers[i].Created < containers[j].Created
})
The loop following:
for i, container := range containers {
if i >= expected {
// Scale Down
container := container
traceOpts := append(tracing.ServiceOptions(service), tracing.ContainerOptions(container)...)
eg.Go(tracing.SpanWrapFuncForErrGroup(ctx, "service/scale/down", traceOpts, func(ctx context.Context) error {
return c.service.stopAndRemoveContainer(ctx, container, timeout, false)
}))
continue
}
...
So more logic is added to check which containers are obsolete which is great, as well as then going by date created, but the order we remove them in is still wrong. Obsolete/old containers are first in the list, but when looping and scaling down the containers first in the list are treated as valid, we instead get rid of those lower down in the list of containers i believe.
Steps To Reproduce
- Create a compose file with single service and no scale options set, then run compose up to launch the container.
- Use the following to scale up
docker compose up -d --scale (service-name)=2 --no-recreate, which will launch another container for the same service. - Now this has been confirmed to be running, scale back down to 1 instance using
docker compose up -d --scale (service-name)=1 --no-recreate - The newer container is removed, which shouldn't be the case, the older container should be removed.
Even if the image used for container 2 is newer, it'll still remove container 2 first, leaving the older image in container 1 running.
Compose Version
Docker Compose version v2.27.0
Docker Environment
Client: Docker Engine - Community
Version: 26.1.1
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.14.0
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.27.0
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 2
Running: 2
Paused: 0
Stopped: 0
Images: 27
Server Version: 26.1.1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: false
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: e377cd56a71523140ca6ae87e30244719194a521
runc version: v1.1.12-0-g51d5e94
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 6.5.0-1019-azure
Operating System: Ubuntu 22.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.742GiB
Name: dravubnt02
ID: e55ebd04-5ca1-46e0-9c6f-2036e43445ae
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Anything else?
No response
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as not stale anymore due to the recent activity.
The newer container is removed, which shouldn't be the case, the older container should be removed.
nope, there's no reason the older is removed as long as both match the service configuration. Having container running for a longer time doesn't mean it's obsolete and a "fresh new one" would perform better (in many cases, this is actually the opposite with JIT compiler / runtime optimisation)
Actually, removing the newer allows to keep the replica number with small values. Otherwise after a few scale up/down cycles you would get 2 or 3-digits in container name as
Is there a specific use-case that would require a distinct logic ?
Interesting and fair! From the initial report (and also why I subscribed) removing the older container would be perfect for rolling updates. That's at least personally the only reason I'd consider running Swarm on a single machine.
Maybe a CLI option would suffice if the existing behavior is still valuable. And even creating container names from missing indices instead of the next one up, although again personally I wouldn't mind at all.
Forgive my ignorance if I've missed any details though, and thank you for the awesome tool you have provided!
The newer container is removed, which shouldn't be the case, the older container should be removed.
nope, there's no reason the older is removed as long as both match the service configuration. Having container running for a longer time doesn't mean it's obsolete and a "fresh new one" would perform better (in many cases, this is actually the opposite with JIT compiler / runtime optimisation) Actually, removing the newer allows to keep the replica number with small values. Otherwise after a few scale up/down cycles you would get 2 or 3-digits in container name as --.
Is there a specific use-case that would require a distinct logic ?
Rolling updates is the main reason. I should clarify the purpose of the ticket as it may not be clear i'm referring to the newer version of an image being disposed of instead of the previous version. I absolutely agree it shouldn't just be used to replace a container running the same image, as it's better to use the older container if it's still valid, we can force a recreation for that if needed.
The previous changes to the logic mean it's no longer just going by createdDate but checks if any of the containers running are obsolete (due to them being an older version of the image etc), which is great, but the logic puts the obsolete container first.
Later in the code it looks like it's then using a for loop and once the index is over the count of expected containers, it begins removing them, meaning the obsolete container remains and the valid one gets removed.
If i have two containers that used the same image tag (latest):
- Container1 - Image tag: latest - Image created date was 01/11/24
- Container2- Image tag: latest - Image created date was 03/11/24 - This is a newer version of the image so Container1 is now obsolete.
It sorts them by placing them in order [Container1, Container2] (obsolete first, then non-obsolete second), performs a for loop, the first is within our desired scale of 1, then the second is removed. The sorting i think just needs to be reversed so we have valid first, obsolete last.
rolling upgrade means image has been updated, and as such container should be detected not to be up-to-date and will be re-created
If i use --no-recreate in the command when scaling down, i'm saying there don't recreate a container, so don't recreate the existing one that's now obsolete. I scale from 1 to 2 instances while also using --no-recreate, so that i have one out-of-date instance and one up-to-date instance.
When scaling down, the out-of-date instance is now the obsolete one.
As you're basically bypassing compose logic intentionally, don't be surprised the result is not consistent with your expectations. What you're looking for is for compose to have an actual rolling upgrade policy to update obsolete containers. Please fill a feature request for this purpose
I'm confused then, your own comment on #11460 suggested yes it should be checking for obsolete containers and not just going by the created date, this check is now added but the wrong one is being removed when scaling down and it's now not a bug?
let me double check.
you're right, I created https://github.com/docker/compose/pull/12272 to cover this scenario Please note replica number (in container names) will grow each and every-time you use this "rolling upgrade" hack
Thanks! Appreciate the fix