compose [BUG] Scale down sorts by obsolete status but stops/removes containers in the reverse order

Description

This was initially logged in #11460 but only part of it was fixed. The fix applied by #11473 does prevent scale down recreating a container incorrectly, however it didn't solve the issue of the container sorting being wrong when finding which ones need to be removed to scale down.

If I scale up to 2 instances, then scale down to 1, the newest container is the one that's removed instead of the oldest. This causes an issue if you use scale up with no-recreate to deploy the new version of an image, then use scale down to get rid of the old one. Currently the newest gets removed so you're left with the old version of an image running.

The old sort logic before #11473:

sort.Slice(containers, func(i, j int) bool {
    return containers[i].Created < containers[j].Created
})

The new sort logic after #11473:

sort.Slice(containers, func(i, j int) bool {
  // select obsolete containers first, so they get removed as we scale down
  if obsolete, _ := mustRecreate(service, containers[i], recreate); obsolete {
	  // i is obsolete, so must be first in the list
	  return true
  }
  if obsolete, _ := mustRecreate(service, containers[j], recreate); obsolete {
	  // j is obsolete, so must be first in the list
	  return false
  }
  
  // For up-to-date containers, sort by container number to preserve low-values in container numbers
  ni, erri := strconv.Atoi(containers[i].Labels[api.ContainerNumberLabel])
  nj, errj := strconv.Atoi(containers[j].Labels[api.ContainerNumberLabel])
  if erri == nil && errj == nil {
	  return ni < nj
  }
  
  // If we don't get a container number (?) just sort by creation date
  return containers[i].Created < containers[j].Created
  })

The loop following:

for i, container := range containers {
    if i >= expected {
        // Scale Down
        container := container
        traceOpts := append(tracing.ServiceOptions(service), tracing.ContainerOptions(container)...)
        eg.Go(tracing.SpanWrapFuncForErrGroup(ctx, "service/scale/down", traceOpts, func(ctx context.Context) error {
            return c.service.stopAndRemoveContainer(ctx, container, timeout, false)
        }))
        continue
    }
...

So more logic is added to check which containers are obsolete which is great, as well as then going by date created, but the order we remove them in is still wrong. Obsolete/old containers are first in the list, but when looping and scaling down the containers first in the list are treated as valid, we instead get rid of those lower down in the list of containers i believe.

Steps To Reproduce

Create a compose file with single service and no scale options set, then run compose up to launch the container.
Use the following to scale up docker compose up -d --scale (service-name)=2 --no-recreate, which will launch another container for the same service.
Now this has been confirmed to be running, scale back down to 1 instance using docker compose up -d --scale (service-name)=1 --no-recreate
The newer container is removed, which shouldn't be the case, the older container should be removed.

Even if the image used for container 2 is newer, it'll still remove container 2 first, leaving the older image in container 1 running.

Compose Version

Docker Compose version v2.27.0

Docker Environment

Client: Docker Engine - Community
 Version:    26.1.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.14.0
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.27.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 2
  Running: 2
  Paused: 0
  Stopped: 0
 Images: 27
 Server Version: 26.1.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: false
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: e377cd56a71523140ca6ae87e30244719194a521
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.5.0-1019-azure
 Operating System: Ubuntu 22.04.3 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 7.742GiB
 Name: dravubnt02
 ID: e55ebd04-5ca1-46e0-9c6f-2036e43445ae
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Anything else?

No response

May 01 '24 10:05 AthenaProductions-Adam

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sep 29 '24 00:09 github-actions[bot]

This issue has been automatically marked as not stale anymore due to the recent activity.

Oct 11 '24 08:10 stale[bot]

The newer container is removed, which shouldn't be the case, the older container should be removed.

nope, there's no reason the older is removed as long as both match the service configuration. Having container running for a longer time doesn't mean it's obsolete and a "fresh new one" would perform better (in many cases, this is actually the opposite with JIT compiler / runtime optimisation) Actually, removing the newer allows to keep the replica number with small values. Otherwise after a few scale up/down cycles you would get 2 or 3-digits in container name as --.

Is there a specific use-case that would require a distinct logic ?

Nov 05 '24 15:11 ndeloof

Interesting and fair! From the initial report (and also why I subscribed) removing the older container would be perfect for rolling updates. That's at least personally the only reason I'd consider running Swarm on a single machine.

Maybe a CLI option would suffice if the existing behavior is still valuable. And even creating container names from missing indices instead of the next one up, although again personally I wouldn't mind at all.

Forgive my ignorance if I've missed any details though, and thank you for the awesome tool you have provided!

Nov 05 '24 16:11 felix-hilden

The newer container is removed, which shouldn't be the case, the older container should be removed.

nope, there's no reason the older is removed as long as both match the service configuration. Having container running for a longer time doesn't mean it's obsolete and a "fresh new one" would perform better (in many cases, this is actually the opposite with JIT compiler / runtime optimisation) Actually, removing the newer allows to keep the replica number with small values. Otherwise after a few scale up/down cycles you would get 2 or 3-digits in container name as --.

Is there a specific use-case that would require a distinct logic ?

Rolling updates is the main reason. I should clarify the purpose of the ticket as it may not be clear i'm referring to the newer version of an image being disposed of instead of the previous version. I absolutely agree it shouldn't just be used to replace a container running the same image, as it's better to use the older container if it's still valid, we can force a recreation for that if needed.

The previous changes to the logic mean it's no longer just going by createdDate but checks if any of the containers running are obsolete (due to them being an older version of the image etc), which is great, but the logic puts the obsolete container first.

Later in the code it looks like it's then using a for loop and once the index is over the count of expected containers, it begins removing them, meaning the obsolete container remains and the valid one gets removed.

If i have two containers that used the same image tag (latest):

Container1 - Image tag: latest - Image created date was 01/11/24
Container2- Image tag: latest - Image created date was 03/11/24 - This is a newer version of the image so Container1 is now obsolete.

It sorts them by placing them in order [Container1, Container2] (obsolete first, then non-obsolete second), performs a for loop, the first is within our desired scale of 1, then the second is removed. The sorting i think just needs to be reversed so we have valid first, obsolete last.

Nov 05 '24 17:11 AthenaProductions-Adam

rolling upgrade means image has been updated, and as such container should be detected not to be up-to-date and will be re-created

Nov 05 '24 17:11 ndeloof

If i use --no-recreate in the command when scaling down, i'm saying there don't recreate a container, so don't recreate the existing one that's now obsolete. I scale from 1 to 2 instances while also using --no-recreate, so that i have one out-of-date instance and one up-to-date instance.

When scaling down, the out-of-date instance is now the obsolete one.

Nov 05 '24 17:11 AthenaProductions-Adam

As you're basically bypassing compose logic intentionally, don't be surprised the result is not consistent with your expectations. What you're looking for is for compose to have an actual rolling upgrade policy to update obsolete containers. Please fill a feature request for this purpose

Nov 05 '24 17:11 ndeloof

I'm confused then, your own comment on #11460 suggested yes it should be checking for obsolete containers and not just going by the created date, this check is now added but the wrong one is being removed when scaling down and it's now not a bug?

Nov 05 '24 17:11 AthenaProductions-Adam

let me double check.

Nov 05 '24 17:11 ndeloof

you're right, I created https://github.com/docker/compose/pull/12272 to cover this scenario Please note replica number (in container names) will grow each and every-time you use this "rolling upgrade" hack

Nov 06 '24 12:11 ndeloof

Thanks! Appreciate the fix

Nov 06 '24 20:11 AthenaProductions-Adam