yum icon indicating copy to clipboard operation
yum copied to clipboard

yum gets deadlocked/hung up (indefinitely) waiting for urlgrabber-ext-down

Open brianjmurrell opened this issue 4 years ago • 6 comments

While I can appreciate that YUM is now deprecated, it's still the main package manager for EL7, which is where I am running into an issue with it just hanging indefinitely, until it is killed.

The process tree looks like this:

 8702 ?        S      0:05  |       \_ /usr/bin/python /usr/bin/yum -y --disablerepo=* --enablerepo=repo.dc.hpdd.intel.com_repository_*,build.hpdd.intel.com_job_daos-stack* install --exclude openmpi daos-1.1.2.1-1.5456.g02ce0510.el7.x86_64 daos-client-1.1.2.1-1.5456.g02ce0510.el7.x86_64 daos-tests-1.1.2.1-1.5456.g02ce0510.el7.x86_64 daos-server-1.1.2.1-1.5456.g02ce0510.el7.x86_64 openmpi3 hwloc ndctl fio patchutils ior-hpc-daos-0 romio-tests-cart-4-daos-0 testmpio-cart-4-daos-0 mpi4py-tests-cart-4-daos-0 hdf5-mpich2-tests-daos-0 hdf5-openmpi3-tests-daos-0 hdf5-vol-daos-mpich2-tests-daos-0 hdf5-vol-daos-openmpi3-tests-daos-0 MACSio-mpich2-daos-0 MACSio-openmpi3-daos-0 mpifileutils-mpich-daos-0
 8705 ?        S      0:00  |           \_ /usr/bin/python /usr/libexec/urlgrabber-ext-down
 8711 ?        S      0:00  |           \_ /usr/bin/python /usr/libexec/urlgrabber-ext-down
 8712 ?        S      0:00  |           \_ /usr/bin/python /usr/libexec/urlgrabber-ext-down

The status of the processes are:

# /tmp/strace -f -p 8702
/tmp/strace: Process 8702 attached
wait4(8711, ^C/tmp/strace: Process 8702 detached
 <detached ...>
# /tmp/strace -f -p 8705
/tmp/strace: Process 8705 attached
read(0, ^C/tmp/strace: Process 8705 detached
 <detached ...>
# /tmp/strace -f -p 8711
/tmp/strace: Process 8711 attached
futex(0x1444c90, FUTEX_WAIT_PRIVATE, 2, NULL^C/tmp/strace: Process 8711 detached
 <detached ...>
# /tmp/strace -f -p 8712
/tmp/strace: Process 8712 attached
futex(0x2174c90, FUTEX_WAIT_PRIVATE, 2, NULL^C/tmp/strace: Process 8712 detached
 <detached ...>

which to me looks like 8702, 8711 and 8705 are deadlocked all waiting/blocked on each other.

brianjmurrell avatar Dec 18 '20 13:12 brianjmurrell

Just as a heads-up, the read(0, indicates process 8705 is blocking on reading standard input.

lukash avatar Jan 04 '21 10:01 lukash

@lukash Yes, I do realize that, but why? stdin is likely a pipe to the parent process, which is simply waiting on children.

brianjmurrell avatar Jan 04 '21 12:01 brianjmurrell

I don't know. You haven't really provided a reproducer, I thought you may want to investigate yourself. This seems like a rare corner case, since you're only hitting it yourself long after the development has stopped. For the same reason it is likely going to be low priority for us unless the impact turns out to be bigger (even with a reproducer).

lukash avatar Jan 05 '21 10:01 lukash

We're hitting the same issue with one of our ansible playbooks. It definitely does seem to be an edge case because this will run 99 times without issues, but we are seeing this issue periodically.

I'm seeing the same futex waits and reads as reported by Brian.

root      3743  3726  3715  3715  0 15:57 ?        00:00:03                 /usr/bin/python /bin/yum -d 2 -y install container-selinux docker-ce-18.09.7-3.el7
root      3744  3743  3715  3715  0 15:57 ?        00:00:00                   /usr/bin/python /usr/libexec/urlgrabber-ext-down
root      3745  3743  3715  3715  0 15:57 ?        00:00:00                   /usr/bin/python /usr/libexec/urlgrabber-ext-down
root      3746  3743  3715  3715  0 15:57 ?        00:00:00                   /usr/bin/python /usr/libexec/urlgrabber-ext-down
root      3747  3743  3715  3715  0 15:57 ?        00:00:01                   /usr/bin/python /usr/libexec/urlgrabber-ext-down
[root@<HOST> <USER>]# strace -p 3747
strace: Process 3747 attached
read(0, 
^Cstrace: Process 3747 detached
 <detached ...>
[root@<HOST> <USER>]# strace -p 3746
strace: Process 3746 attached
futex(0x26fbb90, FUTEX_WAIT_PRIVATE, 2, NULL
^Cstrace: Process 3746 detached
 <detached ...>
[root@<HOST> <USER>]# strace -p 3745
strace: Process 3745 attached
futex(0x16acb70, FUTEX_WAIT_PRIVATE, 2, NULL
^Cstrace: Process 3745 detached
 <detached ...>
[root@<HOST> <USER>]# strace -p 3744
strace: Process 3744 attached
read(0, 
^Cstrace: Process 3744 detached
 <detached ...>

mikebriggs2k avatar Feb 19 '21 17:02 mikebriggs2k

Are you setting minrate/timeout?

james-antill avatar Feb 19 '21 18:02 james-antill

When we try to install ROCm on CentOS 7.9.2009 Docker, the same problem persists. It happens about once every 20 times.

master@:~> ps -eaf | grep 26373 master 20906 18586 0 12:33 pts/2 00:00:00 grep --color=auto 26373 root 26373 518 0 03:40 ? 00:00:00 /usr/bin/python /usr/bin/yum -y install rocm-openmp-sdk5.3.2 root 26388 26373 0 03:40 ? 00:00:00 /usr/bin/python /usr/libexec/urlgrabber-ext-down root 26389 26373 0 03:40 ? 00:00:00 /usr/bin/python /usr/libexec/urlgrabber-ext-down master@:~> sudo strace -p 26388 strace: Process 26388 attached futex(0x2233bb0, FUTEX_WAIT_PRIVATE, 2, NULL^Cstrace: Process 26388 detached <detached ...>

master@:~> sudo strace -p 26389 strace: Process 26389 attached read(0, ^Cstrace: Process 26389 detached <detached ...>

master@:~> master@:~> sudo strace -p 26373 strace: Process 26373 attached wait4(18278, ^Cstrace: Process 26373 detached <detached ...>

master@:~>

Do we have any solution or workaround for this problem?

rponnuru avatar Nov 29 '22 20:11 rponnuru