pixie icon indicating copy to clipboard operation
pixie copied to clipboard

vizier-pem pods fail with "Could not find CGroup base path" on k3s & Debian 11

Open FutureMatt opened this issue 2 years ago • 3 comments

When running Pixie on k3s v1.22.9+k3s1 & Debian "Bullseye" 11 cgroup metadata can not be found and results in no pod metadata being found or used.

It first materialises as:

E20220513 14:07:36.625142 55870 cgroup_metadata_reader.cc:47] Failed to create path resolver. Falling back to legacy path resolver. [error = Not Found : Could not find CGroup base path]
E20220513 14:07:36.625186 55870 cgroup_metadata_reader.cc:55] Failed to create legacy path resolver. This is not recoverable. [error = Not Found : Could not find CGroup base path]

followed by the pod logs filling with

W20220513 14:07:41.633569 55870 state_manager.cc:269] Failed to read PID info for pod=589745f1-53f6-4777-9914-acd052d6ab2d, cid=faca31ddd1f8e0ee34707063a9db4abf84fec044ec853e7000e445e8ed4b0f53 [msg=No valid cgroup path resolver.]
W20220513 14:07:41.633586 55870 state_manager.cc:269] Failed to read PID info for pod=90a7c5b3-094e-4c6f-ac5b-27b01f3899ff, cid=294321e457b9b24dea279884c0cda2f4ddd77a960f5049aa74eef30d4fa005eb [msg=No valid cgroup path resolver.]
W20220513 14:07:41.633596 55870 state_manager.cc:269] Failed to read PID info for pod=83844439-1e89-47e7-97b0-904212c31ea6, cid=acd8d1ac75174aa0856d15c83c6aad2fd681bde09b5aac3f4123529bf31f1c94 [msg=No valid cgroup path resolver.]
W20220513 14:07:41.633606 55870 state_manager.cc:269] Failed to read PID info for pod=cdb690dd-ff29-4319-8684-259013512641, cid=36ad29164b73fcb4f99b75f5633410e49878d3782ac30791a73d198da718bd9e [msg=No valid cgroup path resolver.]
W20220513 14:07:41.633615 55870 state_manager.cc:269] Failed to read PID info for pod=83844439-1e89-47e7-97b0-904212c31ea6, cid= [msg=No valid cgroup path resolver.]

To Reproduce Setup a k3s cluster on Debian 11 and install Pixie as per the user guide.

Expected behavior Pod cgroup metadata should be found.

Logs pixie_logs_20220513151327.zip

App information (please complete the following information):

  • Pixie version: 0.7.9+Distribution.a47d77a.20220510221149.1
  • K8s cluster version: v1.22.9+k3s1
  • Node Kernel version: 5.10.0-14-amd64

Additional context The following directory listings should help debug the logic for finding cgroup information. Looking at

root@homelab-202:~# ls -al /sys/fs/cgroup/
total 0
dr-xr-xr-x 12 root root 0 May 13 14:38 .
drwxr-xr-x  7 root root 0 May 13 14:38 ..
-r--r--r--  1 root root 0 May 13 14:38 cgroup.controllers
-rw-r--r--  1 root root 0 May 13 14:38 cgroup.max.depth
-rw-r--r--  1 root root 0 May 13 14:38 cgroup.max.descendants
-rw-r--r--  1 root root 0 May 13 14:38 cgroup.procs
-r--r--r--  1 root root 0 May 13 14:38 cgroup.stat
-rw-r--r--  1 root root 0 May 13 14:38 cgroup.subtree_control
-rw-r--r--  1 root root 0 May 13 14:38 cgroup.threads
-rw-r--r--  1 root root 0 May 13 14:38 cpu.pressure
-r--r--r--  1 root root 0 May 13 14:38 cpuset.cpus.effective
-r--r--r--  1 root root 0 May 13 14:38 cpuset.mems.effective
-r--r--r--  1 root root 0 May 13 14:38 cpu.stat
drwxr-xr-x  2 root root 0 May 13 14:38 dev-hugepages.mount
drwxr-xr-x  2 root root 0 May 13 14:38 dev-mqueue.mount
drwxr-xr-x  2 root root 0 May 13 14:38 init.scope
-rw-r--r--  1 root root 0 May 13 14:38 io.cost.model
-rw-r--r--  1 root root 0 May 13 14:38 io.cost.qos
-rw-r--r--  1 root root 0 May 13 14:38 io.pressure
-r--r--r--  1 root root 0 May 13 14:38 io.stat
drwxr-xr-x  4 root root 0 May 13 14:38 kubepods
-r--r--r--  1 root root 0 May 13 14:38 memory.numa_stat
-rw-r--r--  1 root root 0 May 13 14:38 memory.pressure
-r--r--r--  1 root root 0 May 13 14:38 memory.stat
drwxr-xr-x  2 root root 0 May 13 14:38 sys-fs-fuse-connections.mount
drwxr-xr-x  2 root root 0 May 13 14:38 sys-kernel-config.mount
drwxr-xr-x  2 root root 0 May 13 14:38 sys-kernel-debug.mount
drwxr-xr-x  2 root root 0 May 13 14:38 sys-kernel-tracing.mount
drwxr-xr-x 21 root root 0 May 13 14:53 system.slice
drwxr-xr-x  3 root root 0 May 13 14:39 user.slice
root@homelab-202:~# ls -al /sys/fs/cgroup/kubepods/
total 0
drwxr-xr-x  4 root root 0 May 13 14:38 .
dr-xr-xr-x 12 root root 0 May 13 14:38 ..
drwxr-xr-x 19 root root 0 May 13 14:38 besteffort
drwxr-xr-x 12 root root 0 May 13 14:38 burstable
-r--r--r--  1 root root 0 May 13 14:38 cgroup.controllers
-r--r--r--  1 root root 0 May 13 14:38 cgroup.events
-rw-r--r--  1 root root 0 May 13 14:38 cgroup.freeze
-rw-r--r--  1 root root 0 May 13 14:38 cgroup.max.depth
-rw-r--r--  1 root root 0 May 13 14:38 cgroup.max.descendants
-rw-r--r--  1 root root 0 May 13 14:38 cgroup.procs
-r--r--r--  1 root root 0 May 13 14:38 cgroup.stat
-rw-r--r--  1 root root 0 May 13 14:38 cgroup.subtree_control
-rw-r--r--  1 root root 0 May 13 14:38 cgroup.threads
-rw-r--r--  1 root root 0 May 13 14:38 cgroup.type
-rw-r--r--  1 root root 0 May 13 14:38 cpu.max
-rw-r--r--  1 root root 0 May 13 14:38 cpu.pressure
-rw-r--r--  1 root root 0 May 13 14:38 cpuset.cpus
-r--r--r--  1 root root 0 May 13 14:38 cpuset.cpus.effective
-rw-r--r--  1 root root 0 May 13 14:38 cpuset.cpus.partition
-rw-r--r--  1 root root 0 May 13 14:38 cpuset.mems
-r--r--r--  1 root root 0 May 13 14:38 cpuset.mems.effective
-r--r--r--  1 root root 0 May 13 14:38 cpu.stat
-rw-r--r--  1 root root 0 May 13 14:38 cpu.weight
-rw-r--r--  1 root root 0 May 13 14:38 cpu.weight.nice
-r--r--r--  1 root root 0 May 13 14:38 hugetlb.1GB.current
-r--r--r--  1 root root 0 May 13 14:38 hugetlb.1GB.events
-r--r--r--  1 root root 0 May 13 14:38 hugetlb.1GB.events.local
-rw-r--r--  1 root root 0 May 13 14:38 hugetlb.1GB.max
-r--r--r--  1 root root 0 May 13 14:38 hugetlb.1GB.rsvd.current
-rw-r--r--  1 root root 0 May 13 14:38 hugetlb.1GB.rsvd.max
-r--r--r--  1 root root 0 May 13 14:38 hugetlb.2MB.current
-r--r--r--  1 root root 0 May 13 14:38 hugetlb.2MB.events
-r--r--r--  1 root root 0 May 13 14:38 hugetlb.2MB.events.local
-rw-r--r--  1 root root 0 May 13 14:38 hugetlb.2MB.max
-r--r--r--  1 root root 0 May 13 14:38 hugetlb.2MB.rsvd.current
-rw-r--r--  1 root root 0 May 13 14:38 hugetlb.2MB.rsvd.max
-rw-r--r--  1 root root 0 May 13 14:38 io.max
-rw-r--r--  1 root root 0 May 13 14:38 io.pressure
-r--r--r--  1 root root 0 May 13 14:38 io.stat
-rw-r--r--  1 root root 0 May 13 14:38 io.weight
-r--r--r--  1 root root 0 May 13 14:38 memory.current
-r--r--r--  1 root root 0 May 13 14:38 memory.events
-r--r--r--  1 root root 0 May 13 14:38 memory.events.local
-rw-r--r--  1 root root 0 May 13 14:38 memory.high
-rw-r--r--  1 root root 0 May 13 14:38 memory.low
-rw-r--r--  1 root root 0 May 13 14:38 memory.max
-rw-r--r--  1 root root 0 May 13 14:38 memory.min
-r--r--r--  1 root root 0 May 13 14:38 memory.numa_stat
-rw-r--r--  1 root root 0 May 13 14:38 memory.oom.group
-rw-r--r--  1 root root 0 May 13 14:38 memory.pressure
-r--r--r--  1 root root 0 May 13 14:38 memory.stat
-r--r--r--  1 root root 0 May 13 14:38 memory.swap.current
-r--r--r--  1 root root 0 May 13 14:38 memory.swap.events
-rw-r--r--  1 root root 0 May 13 14:38 memory.swap.high
-rw-r--r--  1 root root 0 May 13 14:38 memory.swap.max
-r--r--r--  1 root root 0 May 13 14:38 pids.current
-r--r--r--  1 root root 0 May 13 14:38 pids.events
-rw-r--r--  1 root root 0 May 13 14:38 pids.max
-r--r--r--  1 root root 0 May 13 14:38 rdma.current
-rw-r--r--  1 root root 0 May 13 14:38 rdma.max

FutureMatt avatar May 13 '22 14:05 FutureMatt

Digging a little deeper into the source of cgroup_path_resolver.cc it looks to be the following line that is the issue:

  // Different hosts may mount different cgroup dirs. Try a couple for robustness.
  const std::vector<std::string> cgroup_dirs = {"cpu,cpuacct", "cpu", "pids"};

https://github.com/pixie-io/pixie/blob/main/src/shared/metadata/cgroup_path_resolver.cc#L34

In my bug above none of the three folders ("cpu,cpuacct", "cpu", "pids") exist and instead kubepods exists within /sys/fs/cgroup directly. Would updating the list of folders to "cpu,cpuacct", "cpu", "pids", "" be an appropriate fix?

(My bug refers to cgroup_metadata_reader.cc for this issue but it appears to have been refactored into cgroup_path_resolver.cc since my version of Pixie was built.

FutureMatt avatar May 13 '22 15:05 FutureMatt

I also get the same issue with different paths on Bullseye with RKE. A slightly older Kernel 5.10.0-13-amd64 and Kube 1.21.7 via using RKE 1.3.

Interestingly the /sys/fs/cgroup directory looks like this

dr-xr-xr-x 15 root root 0 May 20 20:56 .
drwxr-xr-x  7 root root 0 May 20 20:51 ..
-r--r--r--  1 root root 0 May 20 20:56 cgroup.controllers
-rw-r--r--  1 root root 0 May 20 20:56 cgroup.max.depth
-rw-r--r--  1 root root 0 May 20 20:56 cgroup.max.descendants
-rw-r--r--  1 root root 0 May 20 20:56 cgroup.procs
-r--r--r--  1 root root 0 May 20 20:56 cgroup.stat
-rw-r--r--  1 root root 0 May 25 16:45 cgroup.subtree_control
-rw-r--r--  1 root root 0 May 20 20:56 cgroup.threads
drwxr-xr-x  2 root root 0 May 20 20:56 cpuacct,cpu
-rw-r--r--  1 root root 0 May 20 20:56 cpu.pressure
-r--r--r--  1 root root 0 May 20 20:56 cpuset.cpus.effective
-r--r--r--  1 root root 0 May 20 20:56 cpuset.mems.effective
-r--r--r--  1 root root 0 May 20 20:56 cpu.stat
drwxr-xr-x  3 root root 0 May 20 20:56 dev-hugepages.mount
drwxr-xr-x  3 root root 0 May 20 20:56 dev-mqueue.mount
drwxr-xr-x  3 root root 0 May 20 20:56 init.scope
-rw-r--r--  1 root root 0 May 20 20:56 io.cost.model
-rw-r--r--  1 root root 0 May 20 20:56 io.cost.qos
-rw-r--r--  1 root root 0 May 20 20:56 io.pressure
-r--r--r--  1 root root 0 May 20 20:56 io.stat
drwxr-xr-x  4 root root 0 May 20 20:56 kubepods.slice
-r--r--r--  1 root root 0 May 20 20:56 memory.numa_stat
-rw-r--r--  1 root root 0 May 20 20:56 memory.pressure
-r--r--r--  1 root root 0 May 20 20:56 memory.stat
drwxr-xr-x  3 root root 0 May 20 20:56 -.mount
drwxr-xr-x  2 root root 0 May 20 20:56 net_prio,net_cls
drwxr-xr-x  3 root root 0 May 20 20:56 sys-fs-fuse-connections.mount
drwxr-xr-x  3 root root 0 May 20 20:56 sys-kernel-config.mount
drwxr-xr-x  3 root root 0 May 20 20:56 sys-kernel-debug.mount
drwxr-xr-x  3 root root 0 May 20 20:56 sys-kernel-tracing.mount
drwxr-xr-x 62 root root 0 May 25 00:00 system.slice
drwxr-xr-x  4 root root 0 May 25 16:36 user.slice

The pod info is hidden away inside kubepods.slice with a full path for a pod container looking like /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podf89ccbb1_f420_432f_a140_fadc63324828.slice/docker-4628f1cf84cef3883080bf3be2b604c33559b70ae440994d3e9e3087ef1f7765.scope.

I'll dig a little deeper into this tomorrow and see if I can cobble together a pull request that could cover both of these extra cases.

FutureMatt avatar May 25 '22 16:05 FutureMatt

@FutureMatt, awesome thanks for taking a look! Any luck on fixing this issue?

zasgar avatar Jun 15 '22 23:06 zasgar

I have the same problem, would you tell me how to fix it?Thanks!

Ma-chengyu avatar Sep 28 '22 09:09 Ma-chengyu

Likely related to : #377, #635

zasgar avatar Oct 28 '22 15:10 zasgar

Likely addressed by #635 (commit 98189f6e939a9e7787ab31feca7ce3f7633a44ed), but following up with testing on a similar environment to confirm.

oazizi000 avatar Dec 08 '22 18:12 oazizi000

Was able to run Pixie on k3s on Debian 11 with the latest cgroup fixes, so closing this out. Please reopen if you are still run into the same issue!

aimichelle avatar Dec 08 '22 18:12 aimichelle