amazon-linux-2023
amazon-linux-2023 copied to clipboard
[Bug] - SSM Auto Update conflicts with cloud-init package install (e.g. userData Docker installation fails on EC2. [Errno 2] No such file or directory)
Describe the bug
When creating an EC2 instance using userData to init Docker, 9/10 times the command yum install docker -y fails giving the following error:
[Errno 2] No such file or directory: '/var/cache/dnf/amazonlinux-db3877fdc20f892f/packages/libnetfilter_conntrack-1.0.8-2.amzn2023.0.2.x86_64.rpm'
The dependency name in that path also changes and it's not always the same.
Out of 30 instances created, only about 4 had their docker installation successful, so sometimes for some reason it doesn't give any error.
If I use a Sleep of 10s inside the bash script before the yum install docker -y command, the installation works without any problem.
If I write twice the yum install docker -y command in the bash script, the installation works.
If I ssh into the created ec2 instance and execute manually the command sudo yum install docker -y, the installation works.
To Reproduce ImageId: ami-0f61de2873e29e866 InstanceType: "t2.micro",
userData script:
#!/bin/bash
set -x
yum update -y
yum upgrade -y
yum install -y docker
systemctl start docker
systemctl enable docker
usermod -aG docker ec2-user
wget https://github.com/docker/compose/releases/download/v2.15.1/docker-compose-$(uname -s)-$(uname -m) -O /usr/bin/docker-compose
chmod +x /usr/bin/docker-compose
Logs cat /var/log/cloud-init-output.log >>>>>>>>
Cloud-init v. 22.2.2 running 'modules:config' at Mon, 10 Jul 2023 14:05:45 +0000. Up 11.41 seconds.
Cloud-init v. 22.2.2 running 'modules:final' at Mon, 10 Jul 2023 14:05:46 +0000. Up 12.60 seconds.
+ yum update -y
Amazon Linux 2023 repository 21 MB/s | 15 MB 00:00
Amazon Linux 2023 Kernel Livepatch repository 272 kB/s | 158 kB 00:00
Last metadata expiration check: 0:00:01 ago on Mon Jul 10 14:05:58 2023.
Dependencies resolved.
Nothing to do.
Complete!
+ yum upgrade -y
Last metadata expiration check: 0:00:04 ago on Mon Jul 10 14:05:58 2023.
Dependencies resolved.
Nothing to do.
Complete!
+ yum install -y docker
Last metadata expiration check: 0:00:06 ago on Mon Jul 10 14:05:58 2023.
Dependencies resolved.
================================================================================
Package Arch Version Repository Size
================================================================================
Installing:
docker x86_64 20.10.23-1.amzn2023.0.1 amazonlinux 42 M
Installing dependencies:
containerd x86_64 1.6.19-1.amzn2023.0.1 amazonlinux 31 M
iptables-libs x86_64 1.8.8-3.amzn2023.0.2 amazonlinux 401 k
iptables-nft x86_64 1.8.8-3.amzn2023.0.2 amazonlinux 183 k
libcgroup x86_64 3.0-1.amzn2023.0.1 amazonlinux 75 k
libnetfilter_conntrack x86_64 1.0.8-2.amzn2023.0.2 amazonlinux 58 k
libnfnetlink x86_64 1.0.1-19.amzn2023.0.2 amazonlinux 30 k
libnftnl x86_64 1.2.2-2.amzn2023.0.2 amazonlinux 84 k
pigz x86_64 2.5-1.amzn2023.0.3 amazonlinux 83 k
runc x86_64 1.1.7-1.amzn2023.0.1 amazonlinux 3.0 M
Transaction Summary
================================================================================
Install 10 Packages
Total download size: 77 M
Installed size: 300 M
Downloading Packages:
(1/10): libnetfilter_conntrack-1.0.8-2.amzn2023 336 kB/s | 58 kB 00:00
(2/10): libnftnl-1.2.2-2.amzn2023.0.2.x86_64.rp 440 kB/s | 84 kB 00:00
(3/10): iptables-nft-1.8.8-3.amzn2023.0.2.x86_6 3.8 MB/s | 183 kB 00:00
(4/10): libcgroup-3.0-1.amzn2023.0.1.x86_64.rpm 909 kB/s | 75 kB 00:00
(5/10): iptables-libs-1.8.8-3.amzn2023.0.2.x86_ 4.6 MB/s | 401 kB 00:00
(6/10): libnfnetlink-1.0.1-19.amzn2023.0.2.x86_ 616 kB/s | 30 kB 00:00
(7/10): pigz-2.5-1.amzn2023.0.3.x86_64.rpm 1.2 MB/s | 83 kB 00:00
(8/10): runc-1.1.7-1.amzn2023.0.1.x86_64.rpm 11 MB/s | 3.0 MB 00:00
(9/10): docker-20.10.23-1.amzn2023.0.1.x86_64.r 29 MB/s | 42 MB 00:01
(10/10): containerd-1.6.19-1.amzn2023.0.1.x86_6 18 MB/s | 31 MB 00:01
--------------------------------------------------------------------------------
Total 32 MB/s | 77 MB 00:02
[Errno 2] No such file or directory: '/var/cache/dnf/amazonlinux-db3877fdc20f892f/packages/libnetfilter_conntrack-1.0.8-2.amzn2023.0.2.x86_64.rpm'
The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'yum clean packages'.
+ systemctl start docker
Failed to start docker.service: Unit docker.service not found.
+ systemctl enable docker
Failed to enable unit: Unit file docker.service does not exist.
+ usermod -aG docker ec2-user
usermod: group 'docker' does not exist
++ uname -s
++ uname -m
+ wget https://github.com/docker/compose/releases/download/v2.15.1/docker-compose-Linux-x86_64 -O /usr/bin/docker-compose
--2023-07-10 14:06:08-- https://github.com/docker/compose/releases/download/v2.15.1/docker-compose-Linux-x86_64
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/15045751/55771899-fdc1-4531-974a-0b71aea19e15?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230710%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230710T140609Z&X-Amz-Expires=300&X-Amz-Signature=d3b008815385781f37ea4890bd1fc8dc49f04ae4cd27fcc6f1f5d44ebe1fbfc4&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=15045751&response-content-disposition=attachment%3B%20filename%3Ddocker-compose-linux-x86_64&response-content-type=application%2Foctet-stream [following]
--2023-07-10 14:06:09-- https://objects.githubusercontent.com/github-production-release-asset-2e65be/15045751/55771899-fdc1-4531-974a-0b71aea19e15?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230710%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230710T140609Z&X-Amz-Expires=300&X-Amz-Signature=d3b008815385781f37ea4890bd1fc8dc49f04ae4cd27fcc6f1f5d44ebe1fbfc4&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=15045751&response-content-disposition=attachment%3B%20filename%3Ddocker-compose-linux-x86_64&response-content-type=application%2Foctet-stream
Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 44953600 (43M) [application/octet-stream]
Saving to: ‘/usr/bin/docker-compose’
0K .......... .......... .......... .......... .......... 0% 15.4M 3s
50K .......... .......... .......... .......... .......... 0% 13.9M 3s
100K .......... .......... .......... .......... .......... 0% 13.6M 3s
[...]
43850K .......... .......... .......... .......... ..........100% 83.0M 0s
43900K 100% 0.00 =1.1s
2023-07-10 14:06:10 (39.2 MB/s) - ‘/usr/bin/docker-compose’ saved [44953600/44953600]
+ chmod +x /usr/bin/docker-compose
Cloud-init v. 22.2.2 finished at Mon, 10 Jul 2023 14:06:11 +0000. Datasource DataSourceEc2. Up 37.29 seconds
Well that's interesting!
Thanks for the report, certainly something we're going to have to dive into and understand what's going on.
Out of interesting, have you tried on any larger instance sizes than t2.micro? Any chance there's something in dmesg about running out of memory or anything? (no need to burn a bunch of your ec2 $ on this if you haven't tested, we can certainly test at scale for this with the steps you have provided)
Hello Stewart thanks for the reply, I have just tried to create 5 t3.medium instances and on all of them docker got installed successfully without any problem and without having to use the sleep command :thumbsup:
cc: @stewartsmith
Just to contribute, we had occasional failures with a related but different error message, and we found that a sleep was required and sufficient to fix the problem. This is on a t3.large.
The error message we got if we tried the dnf install too early was
"can't create transaction lock on /var/lib/rpm/.rpm.lock (Resource temporarily unavailable)"
Here's the relevant part of our userdata script, with our workaround:
# Script written for Amazon Linux 2023
printf "\n*** $(date +'%Y-%m-%d %H:%M:%S') INITIAL SETUP STARTING\n"
uname -a
printf "\n\n*** $(date +'%Y-%m-%d %H:%M:%S') Checking for and applying OS updates\n"
system_release=$(rpm -q system-release --qf "%{VERSION}")
printf "\n*** Current release ${system_release}\n"
dnf check-release-update
dnf check-update --releasever=${system_release}
dnf update -y --releasever=${system_release}
# sleep, in case of this error:
# "can't create transaction lock on /var/lib/rpm/.rpm.lock (Resource temporarily unavailable)"
sleep 60
printf "\n\n*** $(date +'%Y-%m-%d %H:%M:%S') Installing additional packages\n"
dnf -y install {{AMI_PACKAGES}}
It look like the userData script is being executed as soon as the instance is running and before finishing to initialise, instead of waiting for it to be in 'OK' status, also someone reported a similar problem in this StackOverflow post.
I had a similar issue and this is what I did to fix it
[Errno 2] No such file or directory: '/var/cache/dnf/amazonlinux-84ef13e8f4afd0b4/packages/libsepol-devel-3.4-3.amzn2023.0.3.x86_64.rpm'
The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'dnf clean packages'.
Fix
sudo dnf upgrade --refresh rpm glibc
sudo rm /var/lib/rpm/.rpm.lock
dnf -y update
dnf install <MY PACKAGES>
@bbenson29 Had any issues with that? A bit hesitant to delete a lock. Would it be better to add a sleep 60 to ensure cloud-init is done?
Same problem here. Why is this still not fixed?
This issue even happens with simple cloud init config where you directly use the packages directive without custom scripts.
AWS Support recommended the following. I agree it seems like there is an issue and we shouldn't need this, didn't in v1 or v2 but it's been working for me.
while true; do
dnf install --assumeyes docker && break
done
while true; do
dnf update --assumeyes && break
done
I've been trying to get to the bottom of this issue over the last 48 hours and I think it is caused by the SSM agent updater running at the same time as the user data scripts.
See: https://docs.aws.amazon.com/systems-manager/latest/userguide/ssm-agent-automatic-updates.html#ssm-agent-automatic-updates-console
I've disabled this in SSM Fleet Manager and the issue has gone away.
If you do disable the automatic SSM agent updates, it's important that you know the consequences of this and implement the updates in another way!
Edit: I'm specifically referring to the RPM lock issue btw...
The malfunction is apparently triggered by the Auto update SSM agent function. The SSM automation document AWS-UpdateSSMAgent is then executed on the EC2 instance, which eventually leads to this error when other YUM/DNF commands are processed at the same time:
RPM: error: can't create transaction lock on /var/lib/rpm/.rpm.lock (Resource temporarily unavailable) Error: Could not run transaction.
A quick and dirty workaround is to temporarily disable the Auto update SSM agent feature (Try it at your own risk, but it is not generally recommended).
Systems Manager > Fleet Manager > Settings > Auto update SSM agent -> Disable
After that, the error will no longer occur. As long as AWS does not have a bugfix here, the workaround of @supergibbs makes the most sense, but leads to multiple entries under yum history.
We've passed this along to the SSM team and they're tracking it in an internal ticket. I'll try and keep an eye on it.
Any update here? Having a real blast with 2023-powered EB engines.
@stewartsmith Would there be any update?
Surely the low number of thumbs up here is more a reflection of those that aren't aware of the issue or have been forced to work around it.
To add yet another workaround I ended up using the MINIMAL AMI, I then installed aws-cfn-bootstrap via UserData and then install both docker and amazon-ssm-agent as part of CloudInit but make made sure to start the amazon-ssm-agent last.
In our project we were starting ssm early on minimal. We added this snippet to our user data. (White space added for readability)
systemctl enable amazon-ssm-agent.service --now
# Give service a moment to start to create its first logs
sleep 2
# Watch the logs for the completion of self-update
stdbuf -i0 -o0 -e0 tail -n +0 -f /var/log/amazon/ssm/amazon-ssm-agent.log \
| awk -e '
/"awsupdateSsmAgent":/,/("status":)/ {
if (/"(Success|Skipped)"/) {
print "awsupdateSsmAgent: " $0;
exit
}
}
'
stdbuf -i0 -o0 -e0 tail -f ...- no-buffered tailawk ...- a multi-line string-match, with an exit on first match of status:Success or status:Skipped.
We were using just a plain sleep 10 however we had some random timing issues crop up around 10 seconds not being enough time sometimes. Rather than increasing the time to be safe we changed to watching the logs for a key event.
Edit: recent update turns this into a "waiting for other to finish" deadlock.
I having the same problem installing Cloudwatch agent. I download the rpm file, then the dnf install runs while dnf clean all runs in the background and breaks the dnf install command and never finishing my machines. Setting a sleep before my dnf install command, the machines build.. but i'd like to not have this. One key element of using an ASG is that a new machine is built quickly. and not having to wait 60 seconds for no good reason.
Just add some steps for easily replicate the issue.
Add userdata as cloud config.
#cloud-config
repo_update: true
repo_upgrade: true
package_reboot_if_required: true
packages:
- docker
- postgresql15
- python3-boto3
Randomly gives error below
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
RPM: error: can't create transaction lock on /var/lib/rpm/.rpm.lock (Resource temporarily unavailable)
The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'dnf clean packages'.
Error: Could not run transaction.
Recently, the situation becomes more frequently.
We are using
Linux ip-10-42-102-106.ap-southeast-2.compute.internal 6.1.79-99.164.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Feb 27 18:02:23 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
cloud-init version is 22.2.2
Maybe also as a hotfix which works well for me.
- stop ssm agent in bootcmd
- do your stuff
- at the end of runcmd start ssm again
package_update: true
package_upgrade: true
bootcmd:
- systemctl stop amazon-ssm-agent
runcmd:
- ... your stuff here
- systemctl start amazon-ssm-agent # at the end restart
Any update for this issue? I put in a hack with a command 'sleep 30', and it worked before. Now this hack doesn't work.
We've also successfully been using the workaround posted above but something further seems to have changed in the last couple of weeks and we've had to apply it to more scripts. Fortunately it does still seem to work although our simple setup scripts are getting messier!
@BlackDark's workaround could cause issues when rebooting the instance, as bootcmd runs on every boot, but runcmd only runs when starting the instance for the first time, meaning the SSM Agent won't start back up. This workaround worked for us, and restarts the SSM agent even when rebooting:
# Workaround for this issue: https://github.com/amazonlinux/amazon-linux-2023/issues/397
# Stops the SSM Agent from updating during the cloud-init process while we're trying to install packages, then restarts once cloud-init is complete.
bootcmd:
- systemctl stop amazon-ssm-agent
write_files:
- path: /var/lib/cloud/scripts/per-boot/startSsmAgent.sh
permissions: "0755" # Allow owner to read/write/execute, others to read/execute
content: |
#!/bin/sh
systemctl start amazon-ssm-agent
I've had similar issue on ubuntu for a long time, the only way for me was to wait for the rpm lock to be released, and then do the rest on my own:
#cloud-config
package_update: false
package_upgrade: false
runcmd:
- while fuser /var/lib/rpm/.rpm.lock > /dev/null 2>&1 ; do sleep 1 ; done
- dnf install -y docker
- systemctl enable docker.service
- systemctl start docker.service
- usermod -a -G docker ec2-user
@avoidik Thank you for posting however that solution doesn't work for me on a t3a.small.
@mg-alanjones since it's just a workaround you're free to experiment with it, for instance try to increase sleep timeout, wait for some other mutex, or perhaps wait for amazon guys to fix this problem, cheers
This workaround works, but seems a bit aggressive. A kinder, gentler approach:
until dnf install -y docker; do
echo "dnf docker install failed, sleeping and retrying..."
sleep 10s
done
Also: come on AWS, how long is this going to continue to be an issue?
We have also experienced this issue on a bunch of R7_G_8_XLARGE instances, so I don't think instance size has anything to do with it. @vjarnot's solution looks the neatest for now.
Presumably we must wrap every dnf command like this for now.
@vjarnot solution works. We use a dnf install wrapper bash script with exponential back-off for any package installation until this is fixed, see below:
#!/bin/bash
attempt=0
max_attempts=10
until dnf install -y $@; do
attempt=$((attempt + 1))
if [ $attempt -ge $max_attempts ]; then
echo "Failed to install $@ after $max_attempts attempts. Exiting."
exit 1
fi
echo "dnf $@ install failed (attempt $attempt/$max_attempts), retrying in $attempt seconds..."
sleep $attempt
done
echo "$@ installed successfully."
Invocation example: ./<dnf_install_wrapper_script_name> <package_name> [<package_name_2>, ...]
The newest version of SSM Agent releasing over the next couple of weeks should wait for cloud-init for up to 10 minutes. https://github.com/aws/amazon-ssm-agent/releases/tag/3.3.987.0
The newest version of SSM Agent releasing over the next couple of weeks should wait for cloud-init for up to 10 minutes. https://github.com/aws/amazon-ssm-agent/releases/tag/3.3.987.0
This is not a solution. This just punts the problem down the line, and anything that performs package management using rpm or dnf later (after cloud-init) is still at risk of interference from SSM Agent's bad interactons with RPM database.
SSM Agent must learn to co-operate with package managers (I believe there are also problems with its interaction with apt/dpkg on Ubuntu).
to add a little bit clarity here, the ssm-agent update suggested above is relying on /usr/bin/cloud-init status --wait
This is not a solution.
SSM Agent must learn to co-operate with package managers
On a more positive note amazon-ssm-agentshould now be using yum for upgrades (since June 27 2024, 3.3.551.0) which I would expect to resolve this fully even without 'wait for cloud-init' (yum/dnf should play nicely with other parallel invocations)
That version does not appear to be fully released, however, with the latest available on Amazon Linux 2023 being 3.3.380.0 currently.
Waiting in anticipation 😄