runc runc pod respawn will destroy Qemu processes (with pid 1)

Description

we are using runc on our k8s deployments that is running Openstack hypervisor on top of that. On Our compute nodes, libvirt pods are responsible for creating qemu instances via using the libvirtd service. Our installation stack was mainly on Centos 7, recently we begin to rollout our new clusters to Ubuntu 22.04. We figure out that on our recently installed (runc.1.1.7) clusters, killing the libvirt pod is also killing the qemu instances.

we need to revert this change

Steps to reproduce the issue

k8s Openstack deployment with ubuntu 22.04 on the nodes or StarlingX deployment with recent runc version
Create VM's on the compute nodes
On the compute node delete the (daemonset) libvirt-pod which will be respawned again
VM's will be gone

Describe the results you received and expected

we expect the instances to be running on the pod

openstack-node001-libvirt-pod:/# virsh list
 Id    Name                           State
----------------------------------------------------
 1     instance-0000269e              running
 2     instance-0000269b              running
 3     instance-00002698              running
 4     instance-00002695              running
 5     instance-00002692              running

but we got the following

openstack-node001-libvirt-pod:/# virsh list
 Id    Name                           State
----------------------------------------------------

What version of runc are you using?

our current version is 1.1.7 prior to version 1.1.6 we have this issue.

Host OS information

PRETTY_NAME="Ubuntu 22.04.2 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.2 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy

Host kernel information

Linux openstack-node001 5.15.0-60-generic #66-Ubuntu SMP Fri Jan 20 14:29:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Nov 07 '23 17:11 senolcolak

we already found the problem

Could you please give a detailed explanation how this commit impacts your case?

Nov 08 '23 05:11 lifubang

@senolcolak it looks like you found that it's commit 10cfd816317789da4393d70ead92ec7c203e1926 that breaks your use case, am I right? Can you explain in more details as to why?

Cc @haircommander

Nov 10 '23 00:11 kolyshkin

@kolyshkin sorry for my late reply. I could not isolate the problem on a separated environment but the link I shared before was wrong. the real problem is in this commit https://github.com/opencontainers/runc/pull/3823/commits/e4ce94e291235da85ef5840100b36bbc16772fa5

the problem is when I create a process that has to be attached to the host system. (Qemu instance) the lifetime of the process depends on the pod lifetime.

basically we would need a process that will run on the host environment. Even if the pod is deleted and the cgroup is wiped the process should continue to run.

Nov 27 '23 18:11 senolcolak