dragonfly Check failed: entry.size() == 3u (5 vs. 3) on version 1.4

Describe the bug We have upgraded Dragonfly from 1.2.1 to 1.4 and getting the following error

I20230630 12:47:27.494930     1 init.cc:69] dragonfly running in opt mode.
I20230630 12:47:27.494998     1 dfly_main.cc:618] Starting dragonfly df-v1.4.0-6d4d740d6e2a060cbbbecd987ee438cc6e60de79
F20230630 12:47:27.495254     1 dfly_main.cc:493] Check failed: entry.size() == 3u (5 vs. 3) 
 Check failure stack trace: 
    @     0x5640638a1ce3  google::LogMessage::SendToLog()
    @     0x56406389a4a7  google::LogMessage::Flush()
    @     0x56406389be2f  google::LogMessageFatal::~LogMessageFatal()
    @     0x564063389c94  main
    @     0x7f479885d083  __libc_start_main
    @     0x56406338d08e  _start
    @              (nil)  (unknown)
 SIGABRT received at time=1688129247 on cpu 0 
PC: @     0x7f479887c00b  (unknown)  raise
[failure_signal_handler.cc : 332] RAW: Signal 11 raised at PC=0x7f479885b941 while already in AbslFailureSignalHandler()
 SIGSEGV received at time=1688129247 on cpu 0 
PC: @     0x7f479885b941  (unknown)  abort

We are using the Helm chart without any configuration and Kubernetes 1.26.3

To Reproduce Steps to reproduce the behavior:

Install 1.2.1 in a k8s cluster running 1.26.3 using helm
Upgrade to version 1.4.0
See error

Environment (please complete the following information):

OS: Ubuntu
Kernel: Linux dragonfly-7d57f468b9-wrqfc 5.4.0-132-generic #148-Ubuntu SMP Mon Oct 17 16:02:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Containerized?: Kubernetes
Dragonfly Version: 1.4.0

Jul 02 '23 17:07 fibis

@fibis I succeeded to run 1.4.0 on GKE 1.25.8-gke.1000. I wonder what's different on your system. Do you know if it reproduces for you with minikube?

In any case, I submitted #1503 which will solve the crash issue in v1.5.0 However, it does not solve the root issue of Dragonfly incorrectly recognizing cgroups file (/proc/self/cgroup). If you can, please paste here this file when running v1.2.1 (you must ssh to the pod and replace self with pid of dragonfly on that pod).

Jul 02 '23 18:07 romange

Ok, based on this: https://kubernetes.io/docs/concepts/architecture/cgroups/#check-cgroup-version I verified that my GKE cluster uses cgroups v1. My pod's cgroup file looks like this:

12:hugetlb:/kubepods/pode55e4d8a-182b-4d3c-91e0-166cf27c83cf/9b0040ed122a24557fc0e04f6241efe88cd36534f480f1bc3821a14d5083ae07
11:pids:/kubepods/pode55e4d8a-182b-4d3c-91e0-166cf27c83cf/9b0040ed122a24557fc0e04f6241efe88cd36534f480f1bc3821a14d5083ae07
10:rdma:/kubepods/pode55e4d8a-182b-4d3c-91e0-166cf27c83cf/9b0040ed122a24557fc0e04f6241efe88cd36534f480f1bc3821a14d5083ae07
9:devices:/kubepods/pode55e4d8a-182b-4d3c-91e0-166cf27c83cf/9b0040ed122a24557fc0e04f6241efe88cd36534f480f1bc3821a14d5083ae07
....

can you please run: stat -fc %T /sys/fs/cgroup/ inside a pod in your cluster? I suspect that you use cgroups v2 and we recognize it incorrectly.

Jul 02 '23 18:07 romange

Hi @romange we are using 1.26.3 cluster. Maybe there are some changes in 1.26, that causing that? We are using it on our dev systems on symbiosis, they could have some special settings.

cgroup file:

root@dragonfly-7d57f468b9-wrqfc:/data# cat /proc/1/cgroup 
12:freezer:/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
11:blkio:/system.slice/containerd.service/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
10:perf_event:/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
9:hugetlb:/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
8:memory:/system.slice/containerd.service/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
7:rdma:/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
6:cpuset:/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
5:devices:/system.slice/containerd.service/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
4:cpu,cpuacct:/system.slice/containerd.service/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
3:pids:/system.slice/containerd.service/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
2:net_cls,net_prio:/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
1:name=systemd:/system.slice/containerd.service/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
0::/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04

stat -fc %T /sys/fs/cgroup/ is returning only tmpfs

Jul 02 '23 18:07 fibis

Ok, actually your cgroup file explains the bug and my fix was not optimal. We expect only two : delimiters, but in your cgroup file has more than 2, i.e. has more than 3 parts. In any case, it's not important for our usecase because we actually only care to have at least 3 parts. I will issue a real fix soon.

Jul 02 '23 18:07 romange

Actually, I would also appreciate if you could list all the files under /sys/fs/cgroup/memory/ on that pod, @fibis. This will help me to understand if : is part of the group name.

Jul 02 '23 18:07 romange

@romange sure, here is the list of the files:

root@dragonfly-7d57f468b9-wrqfc:/data# ls -al /sys/fs/cgroup/memory/
total 0
drwxr-xr-x  2 root root   0 Jul  2 17:17 .
dr-xr-xr-x 14 root root 360 Jul  2 17:17 ..
-rw-r--r--  1 root root   0 Jul  2 17:17 cgroup.clone_children
--w--w--w-  1 root root   0 Jul  2 17:17 cgroup.event_control
-rw-r--r--  1 root root   0 Jul  2 17:18 cgroup.procs
-rw-r--r--  1 root root   0 Jul  2 17:17 memory.failcnt
--w-------  1 root root   0 Jul  3 07:39 memory.force_empty
-rw-r--r--  1 root root   0 Jul  2 17:17 memory.kmem.failcnt
-rw-r--r--  1 root root   0 Jul  2 17:17 memory.kmem.limit_in_bytes
-rw-r--r--  1 root root   0 Jul  2 17:17 memory.kmem.max_usage_in_bytes
-r--r--r--  1 root root   0 Jul  3 07:39 memory.kmem.slabinfo
-rw-r--r--  1 root root   0 Jul  2 17:17 memory.kmem.tcp.failcnt
-rw-r--r--  1 root root   0 Jul  2 17:17 memory.kmem.tcp.limit_in_bytes
-rw-r--r--  1 root root   0 Jul  2 17:17 memory.kmem.tcp.max_usage_in_bytes
-r--r--r--  1 root root   0 Jul  2 17:17 memory.kmem.tcp.usage_in_bytes
-r--r--r--  1 root root   0 Jul  2 17:17 memory.kmem.usage_in_bytes
-rw-r--r--  1 root root   0 Jul  2 17:17 memory.limit_in_bytes
-rw-r--r--  1 root root   0 Jul  2 17:17 memory.max_usage_in_bytes
-rw-r--r--  1 root root   0 Jul  3 07:39 memory.move_charge_at_immigrate
-r--r--r--  1 root root   0 Jul  2 17:17 memory.numa_stat
-rw-r--r--  1 root root   0 Jul  2 17:18 memory.oom_control
----------  1 root root   0 Jul  3 07:39 memory.pressure_level
-rw-r--r--  1 root root   0 Jul  2 17:17 memory.soft_limit_in_bytes
-r--r--r--  1 root root   0 Jul  2 17:17 memory.stat
-rw-r--r--  1 root root   0 Jul  3 07:39 memory.swappiness
-r--r--r--  1 root root   0 Jul  2 17:17 memory.usage_in_bytes
-rw-r--r--  1 root root   0 Jul  2 17:17 memory.use_hierarchy
-rw-r--r--  1 root root   0 Jul  3 07:39 notify_on_release
-rw-r--r--  1 root root   0 Jul  3 07:39 tasks

Jul 03 '23 08:07 fibis

I do not know how to replicate your environment and reproduce cgroup names like yours but we submitted the workaround which should at least solve the crash issue. Available starting from v1.6.0

Jul 13 '23 07:07 romange