amazon-linux-2023 icon indicating copy to clipboard operation
amazon-linux-2023 copied to clipboard

[Bug] - SSM Auto Update conflicts with cloud-init package install (e.g. userData Docker installation fails on EC2. [Errno 2] No such file or directory)

Open ambrosdavid opened this issue 2 years ago • 31 comments

Describe the bug When creating an EC2 instance using userData to init Docker, 9/10 times the command yum install docker -y fails giving the following error: [Errno 2] No such file or directory: '/var/cache/dnf/amazonlinux-db3877fdc20f892f/packages/libnetfilter_conntrack-1.0.8-2.amzn2023.0.2.x86_64.rpm' The dependency name in that path also changes and it's not always the same.

Out of 30 instances created, only about 4 had their docker installation successful, so sometimes for some reason it doesn't give any error.

If I use a Sleep of 10s inside the bash script before the yum install docker -y command, the installation works without any problem. If I write twice the yum install docker -y command in the bash script, the installation works. If I ssh into the created ec2 instance and execute manually the command sudo yum install docker -y, the installation works.

To Reproduce ImageId: ami-0f61de2873e29e866 InstanceType: "t2.micro",

userData script:

#!/bin/bash set -x yum update -y yum upgrade -y yum install -y docker systemctl start docker systemctl enable docker usermod -aG docker ec2-user wget https://github.com/docker/compose/releases/download/v2.15.1/docker-compose-$(uname -s)-$(uname -m) -O /usr/bin/docker-compose chmod +x /usr/bin/docker-compose

Logs cat /var/log/cloud-init-output.log >>>>>>>>

Cloud-init v. 22.2.2 running 'modules:config' at Mon, 10 Jul 2023 14:05:45 +0000. Up 11.41 seconds.
Cloud-init v. 22.2.2 running 'modules:final' at Mon, 10 Jul 2023 14:05:46 +0000. Up 12.60 seconds.
+ yum update -y
Amazon Linux 2023 repository                     21 MB/s |  15 MB     00:00    
Amazon Linux 2023 Kernel Livepatch repository   272 kB/s | 158 kB     00:00    
Last metadata expiration check: 0:00:01 ago on Mon Jul 10 14:05:58 2023.
Dependencies resolved.
Nothing to do.
Complete!
+ yum upgrade -y
Last metadata expiration check: 0:00:04 ago on Mon Jul 10 14:05:58 2023.
Dependencies resolved.
Nothing to do.
Complete!
+ yum install -y docker
Last metadata expiration check: 0:00:06 ago on Mon Jul 10 14:05:58 2023.
Dependencies resolved.
================================================================================
 Package                 Arch    Version                     Repository    Size
================================================================================
Installing:
 docker                  x86_64  20.10.23-1.amzn2023.0.1     amazonlinux   42 M
Installing dependencies:
 containerd              x86_64  1.6.19-1.amzn2023.0.1       amazonlinux   31 M
 iptables-libs           x86_64  1.8.8-3.amzn2023.0.2        amazonlinux  401 k
 iptables-nft            x86_64  1.8.8-3.amzn2023.0.2        amazonlinux  183 k
 libcgroup               x86_64  3.0-1.amzn2023.0.1          amazonlinux   75 k
 libnetfilter_conntrack  x86_64  1.0.8-2.amzn2023.0.2        amazonlinux   58 k
 libnfnetlink            x86_64  1.0.1-19.amzn2023.0.2       amazonlinux   30 k
 libnftnl                x86_64  1.2.2-2.amzn2023.0.2        amazonlinux   84 k
 pigz                    x86_64  2.5-1.amzn2023.0.3          amazonlinux   83 k
 runc                    x86_64  1.1.7-1.amzn2023.0.1        amazonlinux  3.0 M

Transaction Summary
================================================================================
Install  10 Packages

Total download size: 77 M
Installed size: 300 M
Downloading Packages:
(1/10): libnetfilter_conntrack-1.0.8-2.amzn2023 336 kB/s |  58 kB     00:00    
(2/10): libnftnl-1.2.2-2.amzn2023.0.2.x86_64.rp 440 kB/s |  84 kB     00:00    
(3/10): iptables-nft-1.8.8-3.amzn2023.0.2.x86_6 3.8 MB/s | 183 kB     00:00    
(4/10): libcgroup-3.0-1.amzn2023.0.1.x86_64.rpm 909 kB/s |  75 kB     00:00    
(5/10): iptables-libs-1.8.8-3.amzn2023.0.2.x86_ 4.6 MB/s | 401 kB     00:00    
(6/10): libnfnetlink-1.0.1-19.amzn2023.0.2.x86_ 616 kB/s |  30 kB     00:00    
(7/10): pigz-2.5-1.amzn2023.0.3.x86_64.rpm      1.2 MB/s |  83 kB     00:00    
(8/10): runc-1.1.7-1.amzn2023.0.1.x86_64.rpm     11 MB/s | 3.0 MB     00:00    
(9/10): docker-20.10.23-1.amzn2023.0.1.x86_64.r  29 MB/s |  42 MB     00:01    
(10/10): containerd-1.6.19-1.amzn2023.0.1.x86_6  18 MB/s |  31 MB     00:01    
--------------------------------------------------------------------------------
Total                                            32 MB/s |  77 MB     00:02     
[Errno 2] No such file or directory: '/var/cache/dnf/amazonlinux-db3877fdc20f892f/packages/libnetfilter_conntrack-1.0.8-2.amzn2023.0.2.x86_64.rpm'
The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'yum clean packages'.
+ systemctl start docker
Failed to start docker.service: Unit docker.service not found.
+ systemctl enable docker
Failed to enable unit: Unit file docker.service does not exist.
+ usermod -aG docker ec2-user
usermod: group 'docker' does not exist
++ uname -s
++ uname -m
+ wget https://github.com/docker/compose/releases/download/v2.15.1/docker-compose-Linux-x86_64 -O /usr/bin/docker-compose
--2023-07-10 14:06:08--  https://github.com/docker/compose/releases/download/v2.15.1/docker-compose-Linux-x86_64
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/15045751/55771899-fdc1-4531-974a-0b71aea19e15?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230710%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230710T140609Z&X-Amz-Expires=300&X-Amz-Signature=d3b008815385781f37ea4890bd1fc8dc49f04ae4cd27fcc6f1f5d44ebe1fbfc4&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=15045751&response-content-disposition=attachment%3B%20filename%3Ddocker-compose-linux-x86_64&response-content-type=application%2Foctet-stream [following]
--2023-07-10 14:06:09--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/15045751/55771899-fdc1-4531-974a-0b71aea19e15?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230710%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230710T140609Z&X-Amz-Expires=300&X-Amz-Signature=d3b008815385781f37ea4890bd1fc8dc49f04ae4cd27fcc6f1f5d44ebe1fbfc4&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=15045751&response-content-disposition=attachment%3B%20filename%3Ddocker-compose-linux-x86_64&response-content-type=application%2Foctet-stream
Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 44953600 (43M) [application/octet-stream]
Saving to: ‘/usr/bin/docker-compose’

     0K .......... .......... .......... .......... ..........  0% 15.4M 3s
    50K .......... .......... .......... .......... ..........  0% 13.9M 3s
   100K .......... .......... .......... .......... ..........  0% 13.6M 3s
  [...]
 43850K .......... .......... .......... .......... ..........100% 83.0M 0s
 43900K                                                       100% 0.00 =1.1s

2023-07-10 14:06:10 (39.2 MB/s) - ‘/usr/bin/docker-compose’ saved [44953600/44953600]

+ chmod +x /usr/bin/docker-compose
Cloud-init v. 22.2.2 finished at Mon, 10 Jul 2023 14:06:11 +0000. Datasource DataSourceEc2.  Up 37.29 seconds

ambrosdavid avatar Jul 10 '23 14:07 ambrosdavid

Well that's interesting!

Thanks for the report, certainly something we're going to have to dive into and understand what's going on.

Out of interesting, have you tried on any larger instance sizes than t2.micro? Any chance there's something in dmesg about running out of memory or anything? (no need to burn a bunch of your ec2 $ on this if you haven't tested, we can certainly test at scale for this with the steps you have provided)

stewartsmith avatar Jul 10 '23 16:07 stewartsmith

Hello Stewart thanks for the reply, I have just tried to create 5 t3.medium instances and on all of them docker got installed successfully without any problem and without having to use the sleep command :thumbsup:

ambrosdavid avatar Jul 12 '23 16:07 ambrosdavid

cc: @stewartsmith Just to contribute, we had occasional failures with a related but different error message, and we found that a sleep was required and sufficient to fix the problem. This is on a t3.large.

The error message we got if we tried the dnf install too early was "can't create transaction lock on /var/lib/rpm/.rpm.lock (Resource temporarily unavailable)"

Here's the relevant part of our userdata script, with our workaround:

# Script written for Amazon Linux 2023

printf "\n*** $(date +'%Y-%m-%d %H:%M:%S') INITIAL SETUP STARTING\n"
uname -a

printf "\n\n*** $(date +'%Y-%m-%d %H:%M:%S') Checking for and applying OS updates\n"
system_release=$(rpm -q system-release --qf "%{VERSION}")
printf "\n*** Current release ${system_release}\n"
dnf check-release-update
dnf check-update --releasever=${system_release}
dnf update -y --releasever=${system_release}

# sleep, in case of this error:
# "can't create transaction lock on /var/lib/rpm/.rpm.lock (Resource temporarily unavailable)"
sleep 60
printf "\n\n*** $(date +'%Y-%m-%d %H:%M:%S') Installing additional packages\n"
dnf -y install {{AMI_PACKAGES}}

mwebber avatar Jul 28 '23 15:07 mwebber

It look like the userData script is being executed as soon as the instance is running and before finishing to initialise, instead of waiting for it to be in 'OK' status, also someone reported a similar problem in this StackOverflow post.

ambrosdavid avatar Jul 28 '23 15:07 ambrosdavid

I had a similar issue and this is what I did to fix it

[Errno 2] No such file or directory: '/var/cache/dnf/amazonlinux-84ef13e8f4afd0b4/packages/libsepol-devel-3.4-3.amzn2023.0.3.x86_64.rpm'
The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'dnf clean packages'.

Fix

sudo dnf upgrade --refresh rpm glibc
sudo rm /var/lib/rpm/.rpm.lock
dnf -y update
dnf install  <MY PACKAGES>

bbenson29 avatar Aug 21 '23 01:08 bbenson29

@bbenson29 Had any issues with that? A bit hesitant to delete a lock. Would it be better to add a sleep 60 to ensure cloud-init is done?

supergibbs avatar Sep 29 '23 18:09 supergibbs

Same problem here. Why is this still not fixed? This issue even happens with simple cloud init config where you directly use the packages directive without custom scripts.

BlackDark avatar Oct 12 '23 14:10 BlackDark

AWS Support recommended the following. I agree it seems like there is an issue and we shouldn't need this, didn't in v1 or v2 but it's been working for me.

while true; do
dnf install --assumeyes docker && break
done

while true; do
dnf update --assumeyes && break
done

supergibbs avatar Oct 12 '23 18:10 supergibbs

I've been trying to get to the bottom of this issue over the last 48 hours and I think it is caused by the SSM agent updater running at the same time as the user data scripts.

See: https://docs.aws.amazon.com/systems-manager/latest/userguide/ssm-agent-automatic-updates.html#ssm-agent-automatic-updates-console

I've disabled this in SSM Fleet Manager and the issue has gone away.

If you do disable the automatic SSM agent updates, it's important that you know the consequences of this and implement the updates in another way!

Edit: I'm specifically referring to the RPM lock issue btw...

rbpltr avatar Oct 18 '23 15:10 rbpltr

The malfunction is apparently triggered by the Auto update SSM agent function. The SSM automation document AWS-UpdateSSMAgent is then executed on the EC2 instance, which eventually leads to this error when other YUM/DNF commands are processed at the same time:

RPM: error: can't create transaction lock on /var/lib/rpm/.rpm.lock (Resource temporarily unavailable) Error: Could not run transaction.

A quick and dirty workaround is to temporarily disable the Auto update SSM agent feature (Try it at your own risk, but it is not generally recommended).

Systems Manager > Fleet Manager > Settings > Auto update SSM agent -> Disable

After that, the error will no longer occur. As long as AWS does not have a bugfix here, the workaround of @supergibbs makes the most sense, but leads to multiple entries under yum history.

AB-DBMC avatar Oct 18 '23 16:10 AB-DBMC

We've passed this along to the SSM team and they're tracking it in an internal ticket. I'll try and keep an eye on it.

stewartsmith avatar Oct 20 '23 17:10 stewartsmith

Any update here? Having a real blast with 2023-powered EB engines.

BenCoffeed avatar Dec 11 '23 17:12 BenCoffeed

@stewartsmith Would there be any update?

Surely the low number of thumbs up here is more a reflection of those that aren't aware of the issue or have been forced to work around it.

To add yet another workaround I ended up using the MINIMAL AMI, I then installed aws-cfn-bootstrap via UserData and then install both docker and amazon-ssm-agent as part of CloudInit but make made sure to start the amazon-ssm-agent last.

hastarin avatar Jan 18 '24 06:01 hastarin

In our project we were starting ssm early on minimal. We added this snippet to our user data. (White space added for readability)

systemctl enable amazon-ssm-agent.service --now
# Give service a moment to start to create its first logs
sleep 2 
# Watch the logs for the completion of self-update
stdbuf -i0 -o0 -e0 tail -n +0 -f /var/log/amazon/ssm/amazon-ssm-agent.log \
    | awk -e '
        /"awsupdateSsmAgent":/,/("status":)/ {
            if (/"(Success|Skipped)"/) {
                print "awsupdateSsmAgent: " $0;
                exit
            }
        }
    '
  • stdbuf -i0 -o0 -e0 tail -f ... - no-buffered tail
  • awk ... - a multi-line string-match, with an exit on first match of status:Success or status:Skipped.

We were using just a plain sleep 10 however we had some random timing issues crop up around 10 seconds not being enough time sometimes. Rather than increasing the time to be safe we changed to watching the logs for a key event.

Edit: recent update turns this into a "waiting for other to finish" deadlock.

rrehbein avatar Jan 19 '24 13:01 rrehbein

I having the same problem installing Cloudwatch agent. I download the rpm file, then the dnf install runs while dnf clean all runs in the background and breaks the dnf install command and never finishing my machines. Setting a sleep before my dnf install command, the machines build.. but i'd like to not have this. One key element of using an ASG is that a new machine is built quickly. and not having to wait 60 seconds for no good reason.

andreverheij avatar Jan 24 '24 00:01 andreverheij

Just add some steps for easily replicate the issue.

Add userdata as cloud config.

#cloud-config

repo_update: true
repo_upgrade: true
package_reboot_if_required: true

packages:
  - docker
  - postgresql15
  - python3-boto3

Randomly gives error below

Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
RPM: error: can't create transaction lock on /var/lib/rpm/.rpm.lock (Resource temporarily unavailable)
The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'dnf clean packages'.
Error: Could not run transaction.

Recently, the situation becomes more frequently.

We are using

Linux ip-10-42-102-106.ap-southeast-2.compute.internal 6.1.79-99.164.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Feb 27 18:02:23 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

cloud-init version is 22.2.2

singlewind avatar Mar 08 '24 00:03 singlewind

Maybe also as a hotfix which works well for me.

  • stop ssm agent in bootcmd
  • do your stuff
  • at the end of runcmd start ssm again
package_update: true
package_upgrade: true

bootcmd:
  - systemctl stop amazon-ssm-agent

runcmd:
  - ... your stuff here
  - systemctl start amazon-ssm-agent # at the end restart

BlackDark avatar Mar 08 '24 10:03 BlackDark

Any update for this issue? I put in a hack with a command 'sleep 30', and it worked before. Now this hack doesn't work.

xu-lei-richard avatar Mar 28 '24 12:03 xu-lei-richard

We've also successfully been using the workaround posted above but something further seems to have changed in the last couple of weeks and we've had to apply it to more scripts. Fortunately it does still seem to work although our simple setup scripts are getting messier!

rowanbeentje avatar Apr 09 '24 08:04 rowanbeentje

@BlackDark's workaround could cause issues when rebooting the instance, as bootcmd runs on every boot, but runcmd only runs when starting the instance for the first time, meaning the SSM Agent won't start back up. This workaround worked for us, and restarts the SSM agent even when rebooting:

# Workaround for this issue: https://github.com/amazonlinux/amazon-linux-2023/issues/397
# Stops the SSM Agent from updating during the cloud-init process while we're trying to install packages, then restarts once cloud-init is complete.
bootcmd:
  - systemctl stop amazon-ssm-agent
write_files:
  - path: /var/lib/cloud/scripts/per-boot/startSsmAgent.sh
    permissions: "0755" # Allow owner to read/write/execute, others to read/execute
    content: |
      #!/bin/sh
      systemctl start amazon-ssm-agent

Booligoosh avatar Apr 24 '24 04:04 Booligoosh

I've had similar issue on ubuntu for a long time, the only way for me was to wait for the rpm lock to be released, and then do the rest on my own:

#cloud-config

package_update: false
package_upgrade: false

runcmd:
  - while fuser /var/lib/rpm/.rpm.lock > /dev/null 2>&1 ; do sleep 1 ; done
  - dnf install -y docker
  - systemctl enable docker.service
  - systemctl start docker.service
  - usermod -a -G docker ec2-user

avoidik avatar May 05 '24 19:05 avoidik

@avoidik Thank you for posting however that solution doesn't work for me on a t3a.small.

mg-alanjones avatar May 10 '24 01:05 mg-alanjones

@mg-alanjones since it's just a workaround you're free to experiment with it, for instance try to increase sleep timeout, wait for some other mutex, or perhaps wait for amazon guys to fix this problem, cheers

avoidik avatar May 10 '24 17:05 avoidik

This workaround works, but seems a bit aggressive. A kinder, gentler approach:

until dnf install -y docker; do
    echo "dnf docker install failed, sleeping and retrying..."
    sleep 10s
done

Also: come on AWS, how long is this going to continue to be an issue?

vjarnot avatar Sep 09 '24 22:09 vjarnot

We have also experienced this issue on a bunch of R7_G_8_XLARGE instances, so I don't think instance size has anything to do with it. @vjarnot's solution looks the neatest for now.

Presumably we must wrap every dnf command like this for now.

sfcoy avatar Sep 12 '24 01:09 sfcoy

@vjarnot solution works. We use a dnf install wrapper bash script with exponential back-off for any package installation until this is fixed, see below:

#!/bin/bash
attempt=0
max_attempts=10
until dnf install -y $@; do
    attempt=$((attempt + 1))
    if [ $attempt -ge $max_attempts ]; then
        echo "Failed to install $@ after $max_attempts attempts. Exiting."
        exit 1
    fi
    echo "dnf $@ install failed (attempt $attempt/$max_attempts), retrying in $attempt seconds..."
    sleep $attempt
done
echo "$@ installed successfully."

Invocation example: ./<dnf_install_wrapper_script_name> <package_name> [<package_name_2>, ...]

fhuebscher avatar Oct 01 '24 09:10 fhuebscher

The newest version of SSM Agent releasing over the next couple of weeks should wait for cloud-init for up to 10 minutes. https://github.com/aws/amazon-ssm-agent/releases/tag/3.3.987.0

gianniLesl avatar Oct 01 '24 20:10 gianniLesl

The newest version of SSM Agent releasing over the next couple of weeks should wait for cloud-init for up to 10 minutes. https://github.com/aws/amazon-ssm-agent/releases/tag/3.3.987.0

This is not a solution. This just punts the problem down the line, and anything that performs package management using rpm or dnf later (after cloud-init) is still at risk of interference from SSM Agent's bad interactons with RPM database.

SSM Agent must learn to co-operate with package managers (I believe there are also problems with its interaction with apt/dpkg on Ubuntu).

nosnilmot avatar Oct 02 '24 07:10 nosnilmot

to add a little bit clarity here, the ssm-agent update suggested above is relying on /usr/bin/cloud-init status --wait

avoidik avatar Oct 02 '24 09:10 avoidik

This is not a solution.

SSM Agent must learn to co-operate with package managers

On a more positive note amazon-ssm-agentshould now be using yum for upgrades (since June 27 2024, 3.3.551.0) which I would expect to resolve this fully even without 'wait for cloud-init' (yum/dnf should play nicely with other parallel invocations)

That version does not appear to be fully released, however, with the latest available on Amazon Linux 2023 being 3.3.380.0 currently.

Waiting in anticipation 😄

nosnilmot avatar Oct 03 '24 15:10 nosnilmot