APM agent fails platform detection on cgroupv2 based systems
Hello!
Widespread cgroupv2 adoption is around the corner, as many popular distributions already come with cgroupv2 enabled by default.
The problem currently is with the fact, that the APM agent cannot detect that it is running on cgroupv2 based system, as it tries to parse the cgroup file which is fairly empty on a cgroupv2 enabled system. This causes the agent to specify "linux" as it's platform, omit container/pod ID's and cause multiple integrations to fail inside the Elastic ecosystem.
The current cgroup implementation uses the /proc/self/cgroup file to obtain relevant container and pod IDs. This file is empty on cgroupv2 based systems, and even in so called cgroupv1 systems, it's an undocumented "feature".
Currently there is no agreed upon method for obtaining this information from inside the container, and it's still a standing issue for the open container spec developers: https://github.com/opencontainers/runtime-spec/issues/1105
My proposal is to use the workaround similar to here:
- https://community.toradex.com/t/python-nullresource-error-when-running-torizoncore-builder-build/15240/4
- https://github.com/open-telemetry/opentelemetry-js-contrib/pull/1181/files
The /proc/self/mountinfo file still contains references to necessary information (pod and container uid).
We could use this until at least the standard is agreed upon, and could be switched out.
An example of the relevant information from my latest stable Flatcar Linux system, with Containerd, Kubernetes 1.24.7 and cgroupv2 enabled:
5051 5044 259:8 /lib/kubelet/pods/68ee930a-2bd8-447b-8deb-426add7a2d09/etc-hosts /etc/hosts rw,relatime - xfs /dev/nvme0n1p3 rw,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota
5052 5046 259:8 /lib/kubelet/pods/68ee930a-2bd8-447b-8deb-426add7a2d09/containers/<pod_name>/2631d4a6 /dev/termination-log rw,relatime - xfs /dev/nvme0n1p3 rw,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota
5053 5044 259:8 /lib/containerd/io.containerd.grpc.v1.cri/sandboxes/199dafcfc5712cd1e9e49e94642e7df6cdf63356bbc3601e9115f26fd0d096e1/hostname /etc/hostname rw,relatime - xfs /dev/nvme0n1p3 rw,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota
5054 5044 259:8 /lib/containerd/io.containerd.grpc.v1.cri/sandboxes/199dafcfc5712cd1e9e49e94642e7df6cdf63356bbc3601e9115f26fd0d096e1/resolv.conf /etc/resolv.conf rw,relatime - xfs /dev/nvme0n1p3 rw,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota
The problem which prevents me from working on this issue is that I do not know which formats these lines can take on different systems.
What are your ideas, would this method work?
https://github.com/elastic/apm/issues/523 tracks this as well, but I'm not hopeful for a central solution from there, as this proposal can be categorized as a "hack".
Any updates on this?
@andandrej Unfortunately we haven't had a chance to address this yet.