Upgraded ecs agent causes Error loading previously saved state from BoltDB
Summary
Upgraded ecs agent on external instance. The ecs service keeps restarting. Ecs agent server fails after this error is logged:
Error loading previously saved state: failed to load previous data from BoltDB: failed to load task engine state: did not find the task of container
Description
Upgraded ecs agent but the service keeps restarting. Refer to logs section.
Environment Details
Ubuntu 22.04.2 LTS
ecs agent version
Package: amazon-ecs-init
Version: 1.82.0-1
Status: install ok installed
Priority: optional
Section: misc
Maintainer: ecs-agent-dev <[email protected]>
Installed-Size: 103 MB
Depends: libc6 (>= 2.3.4), systemd, docker-ce (>= 17.12.0) | docker-engine (>= 1.6.0) | docker-ee | docker.io
Homepage: https://aws.amazon.com/ecs
Download-Size: unknown
APT-Manual-Installed: yes
APT-Sources: /var/lib/dpkg/status
Description: Starts the Amazon ECS Agent
amazon-ecs-init may be run to register an EC2 instance as an Amazon ECS
Container Instance.
docker info
Client: Docker Engine - Community
Version: 24.0.5
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.11.2
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.20.2
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 14
Running: 11
Paused: 0
Stopped: 3
Images: 15
Server Version: 24.0.5
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 8165feabfdfe38c65b599c4993d227328c231fca
runc version: v1.1.8-0-g82f18fe
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
Kernel Version: 5.15.0-86-generic
Operating System: Ubuntu 22.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 6
Total Memory: 15.61GiB
Name: nmlcaap135
ID: 48edb6c9-be8d-4bf2-b7ee-fb3b6be57bac
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
df -h
Filesystem Size Used Avail Use% Mounted on
tmpfs 1.6G 168M 1.4G 11% /run
/dev/mapper/ubuntu--vg-ubuntu--lv 98G 4.8G 89G 6% /
tmpfs 7.9G 0 7.9G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 4.0M 0 4.0M 0% /sys/fs/cgroup
/dev/sda2 2.0G 251M 1.6G 14% /boot
/dev/mapper/ubuntu--vg-ubuntu--lv--var 20G 8.1G 11G 44% /var
tmpfs 1.6G 4.0K 1.6G 1% /run/user/1892892083
Supporting Log Snippets
level=info time=2024-03-20T00:07:09Z msg="Agent version associated with task model in boltdb 1.75.0 is bigger or equal to threshold 1.0.0. Skipping transformation."
level=critical time=2024-03-20T00:07:09Z msg="Error loading previously saved state: failed to load previous data from BoltDB: failed to load task engine state: did not find the task of container XXXX: arn:aws:ecs:REGION:1111111111:task/XXXX/06548beea8f34300a560e8aa2e660cb" module=agent.go
Hi @pzcfoo,
Is this issue continuing to occur, or were you able to fix the starting of agent? This is due to a small edge case that corrupts task and container information when agent is terminating. We are tracking this issue internally. For a temporary mitigation, when upgrading agent, you could try to stop tasks on the instance beforehand. I would suggest setting up the external instance with ECS from scratch if agent is still not starting, if that is feasible..
Thank you
Hi @hozkaya2000 Reinstalling ecs agent (including deleting all related files) and registering the cluster again fixed the issue. On other hosts, stopping all tasks before upgrading was successful in preventing this. Thanks
We released a permanent fix for this issue in https://github.com/aws/amazon-ecs-agent/releases/tag/v1.82.3. Please reopen the issue if you see it again. :)
Thank you!
Fixed in https://github.com/aws/amazon-ecs-agent/pull/3987