amazon-ecs-agent Upgraded ecs agent causes Error loading previously saved state from BoltDB

Summary

Upgraded ecs agent on external instance. The ecs service keeps restarting. Ecs agent server fails after this error is logged:

Error loading previously saved state: failed to load previous data from BoltDB: failed to load task engine state: did not find the task of container

Description

Upgraded ecs agent but the service keeps restarting. Refer to logs section.

Environment Details

Ubuntu 22.04.2 LTS

ecs agent version

Package: amazon-ecs-init
Version: 1.82.0-1
Status: install ok installed
Priority: optional
Section: misc
Maintainer: ecs-agent-dev <[email protected]>
Installed-Size: 103 MB
Depends: libc6 (>= 2.3.4), systemd, docker-ce (>= 17.12.0) | docker-engine (>= 1.6.0) | docker-ee | docker.io
Homepage: https://aws.amazon.com/ecs
Download-Size: unknown
APT-Manual-Installed: yes
APT-Sources: /var/lib/dpkg/status
Description: Starts the Amazon ECS Agent
 amazon-ecs-init may be run to register an EC2 instance as an Amazon ECS
 Container Instance.

docker info

 Client: Docker Engine - Community
 Version:    24.0.5
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.11.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.20.2
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 14
  Running: 11
  Paused: 0
  Stopped: 3
 Images: 15
 Server Version: 24.0.5
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8165feabfdfe38c65b599c4993d227328c231fca
 runc version: v1.1.8-0-g82f18fe
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
 Kernel Version: 5.15.0-86-generic
 Operating System: Ubuntu 22.04.2 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 6
 Total Memory: 15.61GiB
 Name: nmlcaap135
 ID: 48edb6c9-be8d-4bf2-b7ee-fb3b6be57bac
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

df -h

Filesystem                              Size  Used Avail Use% Mounted on
tmpfs                                   1.6G  168M  1.4G  11% /run
/dev/mapper/ubuntu--vg-ubuntu--lv        98G  4.8G   89G   6% /
tmpfs                                   7.9G     0  7.9G   0% /dev/shm
tmpfs                                   5.0M     0  5.0M   0% /run/lock
tmpfs                                   4.0M     0  4.0M   0% /sys/fs/cgroup
/dev/sda2                               2.0G  251M  1.6G  14% /boot
/dev/mapper/ubuntu--vg-ubuntu--lv--var   20G  8.1G   11G  44% /var
tmpfs                                   1.6G  4.0K  1.6G   1% /run/user/1892892083

Supporting Log Snippets

level=info time=2024-03-20T00:07:09Z msg="Agent version associated with task model in boltdb 1.75.0 is bigger or equal to threshold 1.0.0. Skipping transformation."

level=critical time=2024-03-20T00:07:09Z msg="Error loading previously saved state: failed to load previous data from BoltDB: failed to load task engine state: did not find the task of container XXXX: arn:aws:ecs:REGION:1111111111:task/XXXX/06548beea8f34300a560e8aa2e660cb" module=agent.go

Mar 20 '24 00:03 pzcfoo

Hi @pzcfoo,

Is this issue continuing to occur, or were you able to fix the starting of agent? This is due to a small edge case that corrupts task and container information when agent is terminating. We are tracking this issue internally. For a temporary mitigation, when upgrading agent, you could try to stop tasks on the instance beforehand. I would suggest setting up the external instance with ECS from scratch if agent is still not starting, if that is feasible..

Thank you

Apr 13 '24 00:04 hozkaya2000

Hi @hozkaya2000 Reinstalling ecs agent (including deleting all related files) and registering the cluster again fixed the issue. On other hosts, stopping all tasks before upgrading was successful in preventing this. Thanks

Apr 16 '24 03:04 pzcfoo

We released a permanent fix for this issue in https://github.com/aws/amazon-ecs-agent/releases/tag/v1.82.3. Please reopen the issue if you see it again. :)

Thank you!

May 06 '24 17:05 amogh09

Fixed in https://github.com/aws/amazon-ecs-agent/pull/3987

May 06 '24 17:05 amogh09