Check failed: entry.size() == 3u (5 vs. 3) on version 1.4
Describe the bug We have upgraded Dragonfly from 1.2.1 to 1.4 and getting the following error
I20230630 12:47:27.494930 1 init.cc:69] dragonfly running in opt mode.
I20230630 12:47:27.494998 1 dfly_main.cc:618] Starting dragonfly df-v1.4.0-6d4d740d6e2a060cbbbecd987ee438cc6e60de79
F20230630 12:47:27.495254 1 dfly_main.cc:493] Check failed: entry.size() == 3u (5 vs. 3)
Check failure stack trace:
@ 0x5640638a1ce3 google::LogMessage::SendToLog()
@ 0x56406389a4a7 google::LogMessage::Flush()
@ 0x56406389be2f google::LogMessageFatal::~LogMessageFatal()
@ 0x564063389c94 main
@ 0x7f479885d083 __libc_start_main
@ 0x56406338d08e _start
@ (nil) (unknown)
SIGABRT received at time=1688129247 on cpu 0
PC: @ 0x7f479887c00b (unknown) raise
[failure_signal_handler.cc : 332] RAW: Signal 11 raised at PC=0x7f479885b941 while already in AbslFailureSignalHandler()
SIGSEGV received at time=1688129247 on cpu 0
PC: @ 0x7f479885b941 (unknown) abort
We are using the Helm chart without any configuration and Kubernetes 1.26.3
To Reproduce Steps to reproduce the behavior:
- Install 1.2.1 in a k8s cluster running 1.26.3 using helm
- Upgrade to version 1.4.0
- See error
Environment (please complete the following information):
- OS: Ubuntu
- Kernel:
Linux dragonfly-7d57f468b9-wrqfc 5.4.0-132-generic #148-Ubuntu SMP Mon Oct 17 16:02:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux - Containerized?: Kubernetes
- Dragonfly Version: 1.4.0
@fibis I succeeded to run 1.4.0 on GKE 1.25.8-gke.1000. I wonder what's different on your system. Do you know if it reproduces for you with minikube?
In any case, I submitted #1503 which will solve the crash issue in v1.5.0 However, it does not solve the root issue of Dragonfly incorrectly recognizing cgroups file (/proc/self/cgroup). If you can, please paste here this file when running v1.2.1 (you must ssh to the pod and replace self with pid of dragonfly on that pod).
Ok, based on this: https://kubernetes.io/docs/concepts/architecture/cgroups/#check-cgroup-version I verified that my GKE cluster uses cgroups v1. My pod's cgroup file looks like this:
12:hugetlb:/kubepods/pode55e4d8a-182b-4d3c-91e0-166cf27c83cf/9b0040ed122a24557fc0e04f6241efe88cd36534f480f1bc3821a14d5083ae07
11:pids:/kubepods/pode55e4d8a-182b-4d3c-91e0-166cf27c83cf/9b0040ed122a24557fc0e04f6241efe88cd36534f480f1bc3821a14d5083ae07
10:rdma:/kubepods/pode55e4d8a-182b-4d3c-91e0-166cf27c83cf/9b0040ed122a24557fc0e04f6241efe88cd36534f480f1bc3821a14d5083ae07
9:devices:/kubepods/pode55e4d8a-182b-4d3c-91e0-166cf27c83cf/9b0040ed122a24557fc0e04f6241efe88cd36534f480f1bc3821a14d5083ae07
....
can you please run: stat -fc %T /sys/fs/cgroup/ inside a pod in your cluster? I suspect that you use cgroups v2
and we recognize it incorrectly.
Hi @romange we are using 1.26.3 cluster. Maybe there are some changes in 1.26, that causing that? We are using it on our dev systems on symbiosis, they could have some special settings.
cgroup file:
root@dragonfly-7d57f468b9-wrqfc:/data# cat /proc/1/cgroup
12:freezer:/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
11:blkio:/system.slice/containerd.service/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
10:perf_event:/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
9:hugetlb:/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
8:memory:/system.slice/containerd.service/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
7:rdma:/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
6:cpuset:/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
5:devices:/system.slice/containerd.service/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
4:cpu,cpuacct:/system.slice/containerd.service/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
3:pids:/system.slice/containerd.service/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
2:net_cls,net_prio:/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
1:name=systemd:/system.slice/containerd.service/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
0::/kubepods-besteffort-pod9d71c162_3779_4fc5_a45a_4538afd1934a.slice:cri-containerd:1373940dd1436bc894457bd0504a4dc8449cc78c90f1724296d813259e30aa04
stat -fc %T /sys/fs/cgroup/ is returning only tmpfs
Ok, actually your cgroup file explains the bug and my fix was not optimal. We expect only two : delimiters, but in your cgroup file has more than 2, i.e. has more than 3 parts. In any case, it's not important for our usecase because we actually only care to have
at least 3 parts. I will issue a real fix soon.
Actually, I would also appreciate if you could list all the files under /sys/fs/cgroup/memory/ on that pod, @fibis. This will help me to understand if : is part of the group name.
@romange sure, here is the list of the files:
root@dragonfly-7d57f468b9-wrqfc:/data# ls -al /sys/fs/cgroup/memory/
total 0
drwxr-xr-x 2 root root 0 Jul 2 17:17 .
dr-xr-xr-x 14 root root 360 Jul 2 17:17 ..
-rw-r--r-- 1 root root 0 Jul 2 17:17 cgroup.clone_children
--w--w--w- 1 root root 0 Jul 2 17:17 cgroup.event_control
-rw-r--r-- 1 root root 0 Jul 2 17:18 cgroup.procs
-rw-r--r-- 1 root root 0 Jul 2 17:17 memory.failcnt
--w------- 1 root root 0 Jul 3 07:39 memory.force_empty
-rw-r--r-- 1 root root 0 Jul 2 17:17 memory.kmem.failcnt
-rw-r--r-- 1 root root 0 Jul 2 17:17 memory.kmem.limit_in_bytes
-rw-r--r-- 1 root root 0 Jul 2 17:17 memory.kmem.max_usage_in_bytes
-r--r--r-- 1 root root 0 Jul 3 07:39 memory.kmem.slabinfo
-rw-r--r-- 1 root root 0 Jul 2 17:17 memory.kmem.tcp.failcnt
-rw-r--r-- 1 root root 0 Jul 2 17:17 memory.kmem.tcp.limit_in_bytes
-rw-r--r-- 1 root root 0 Jul 2 17:17 memory.kmem.tcp.max_usage_in_bytes
-r--r--r-- 1 root root 0 Jul 2 17:17 memory.kmem.tcp.usage_in_bytes
-r--r--r-- 1 root root 0 Jul 2 17:17 memory.kmem.usage_in_bytes
-rw-r--r-- 1 root root 0 Jul 2 17:17 memory.limit_in_bytes
-rw-r--r-- 1 root root 0 Jul 2 17:17 memory.max_usage_in_bytes
-rw-r--r-- 1 root root 0 Jul 3 07:39 memory.move_charge_at_immigrate
-r--r--r-- 1 root root 0 Jul 2 17:17 memory.numa_stat
-rw-r--r-- 1 root root 0 Jul 2 17:18 memory.oom_control
---------- 1 root root 0 Jul 3 07:39 memory.pressure_level
-rw-r--r-- 1 root root 0 Jul 2 17:17 memory.soft_limit_in_bytes
-r--r--r-- 1 root root 0 Jul 2 17:17 memory.stat
-rw-r--r-- 1 root root 0 Jul 3 07:39 memory.swappiness
-r--r--r-- 1 root root 0 Jul 2 17:17 memory.usage_in_bytes
-rw-r--r-- 1 root root 0 Jul 2 17:17 memory.use_hierarchy
-rw-r--r-- 1 root root 0 Jul 3 07:39 notify_on_release
-rw-r--r-- 1 root root 0 Jul 3 07:39 tasks
I do not know how to replicate your environment and reproduce cgroup names like yours but we submitted the workaround which should at least solve the crash issue. Available starting from v1.6.0