tetragon Tetragon: rework cgroups handling

This improves Tetragon cgroups and process tracking.

Every process will have its own tracking Cgroup ID that should reflect the proper container.

TODO

[x] Handle cgroup tracepoint: mkdir, rmdir, release, attach_task
[x] Discover cgroup configuration and store tracking cgroup IDs with their info inside tg_cgrps_tracking_map
[x] Store Tetragon cgroup and configuration in TetragonCong tg_conf_map
[x] Send cgroup events only when log-level=trace
[x] Switched to checking task cgroup ancestors during fork to get the tracking cgroup ID. This allow to track all processes that are cloned after Tetragon starts including processes that we are missing parent information.
[x] Use cilium/ebpf for raw tracepoints
[x] Handle and test both cgroupv1 and cgroupv2 hierarchies
[x] Fix failed unit tests
[x] Detect Cgroup mode including systemd and non-systemd installations
[x] Detect running mode k8s, container, or standalone systemd service or user launched program.
[x] Handle Cgroup namespaced during Cgroup probing and detection.
[x] Read Cgroup ID from userspace and store it into tg_conf_map then match it at cgroup_attach_task to improve Tetragon matching and avoid matching another pid.
[x] Make this PR transparent, meaning if we do not have the tracking Cgroup ID then fallback to old behavior.
[x] Do not fail if we can't detect Cgroup configuration, log errors and make cgroup bpf programs return
[x] Handle cgroup IDs being zero from bpf context, do not operate on cgroups nor query the tracking cgroup bpf map
[ ] Go unit tests
[x] If we can't detect Deployment mode nor Cgroup configuration, warn users.
[ ] Add user documentation on how to report their Cgroup environment

Followups

[ ] Report flag errors if bpf get_cgroup_id() returned zero. Right now it is used only in bpf tracking and we handle this case.
[ ] Identify k8s cgroup ID ancestor and use it as cgroup root to apply CRDs and other filtering logic, the aim is to make it easy to identify if a process belongs to k8s hierarchy or not, and if so apply our CRDs to only those processes.
[ ] See if we have to set the Tracking Cgroup ID for other generated events if it is not already set. Right now we set it during fork, execve. This may override with scan existing processes task (/proc scanning).
[ ] Dump cgroup, deployment and other runtime configuration so users can report it, see bugtool
[ ] If we can't detect Deployment mode nor Cgroup configuration, in addition to warning users, suggest them to send us bugtool dump info, eventually add necessary logic to bugtool.
[ ] Currently, we detect when a cgroup is created (state: NEW). Then, we associate processes with this cgroup at exec/fork time (the state transitions from NEW->RUNNING). At some point between these two events, the conatiner runtime will do a cgroup_transfer_tasks and cgroup_attach_task. Hooking into these events, might offer more information (e.g., pre-running state, also same thing when container finished all processes terminated but its cgroup is being removed). Can this happen?
[ ] Metrics, usage counting and statistics about tused bpf cgroups maps
[ ] Standardize bpf helpers to access bpf cgroup maps
[ ] expose cgroup id to events (?)
[ ] Add more tests: bpf unit tests and e2e tests (e.g., GKE, kind)
[ ] Right now we may get spammed by debug level messages, reduce it. (This is only when user runs tetragon in debug mode.)
[ ] Move cgroup handlers into their own sensor pkg/sensors/cgroup/ (is it worth it?)
[ ] There are some paths in the kernel that will cause a task to wake up (wake_up_new_task), which is were we insert our hooks. There are paths in the kernel other than task creation that end up calling this function (e.g., io_uring and other task work-queues). Will our hooks work correctly in these cases?
- https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L222
- https://elixir.bootlin.com/linux/v5.19/source/fs/io-wq.c#L727

Signed-off-by: Djalal Harouni [email protected]

Jul 11 '22 16:07 tixxdz

Patches summary:

Patches 1 -> 11 bpf:make: compile new bpf cgroup programs Adds the bpf cgroup base logic: from structures to bpf programs.
Patches 12 -> 14 bpf:cgroup: fix task tracking Cgroup ID during execve event Still in bpf part: adds helper to set cgroup tracking ID, updates both fork and execve related bpf programs to set and use the cgroup ID
Patch 15 bpf:cgroups: kernels prior to 5.5 have different kernfs_node id On kernels 5.4 and prior we were not able to get the cgroup ID due the kernfs node having a different layout. This never worked in current Tetragon, we just didn't notice as the cgrpid is not being used, well it is set to msg.Kube.Cgprid but we don't use it, we use the msg.Kube.Docker which is the cgroup name.
Patch 16 pkg:sensors: load new cgroup bpf programs Loads the new cgroup bpf programs
Patches 17 - 24 pkg:sensors: store Cgroupfs Magic in Tetragon Conf Prepare the go part by adding related bpf maps, structs, access helpers and grpc ones , and finally the cgroups package
Patch 25 bpf:cgroup: limit the number of maximum nested cgroups Limits maximum nested cgroups
Patches 26 - 30 pkg:cgroups: improve logging and how we report various typed values Probes cgroup bpf programs to detect cgroup configuration, improves deployment mode detection and logging in general
Patches 31 - 33 pkg:cgroups: ensure that subsystem index is in-band Parses cgroup controllers and makes sure that we select the ones that are usually setup: memory and pids, these will be used in cgroupv1 setup, on cgroupv2 when we read cgroup id we first try the bpf_get_current_cgroup_id() helper of the default hierarchy, the controller then is a fallback, but we also will use it to read the cgroup name. The controller is passed by subsystem index to the bpf helpers.
Patches 34 - 36 bpf:cgroup: ensure that tracked cgroups belong to the right hierarchy We improve cgroup logic by making sure that we operate on the right cgroup hierarchy and controllers. Also we improve Tetragon own detection by checking the cgroup ID that is botained from user space.
Patches 37 - 38 bpf:cgroups: do nothing if we fail to read cgroup ID Improves how we gather cgroup Info during execve events by checking the cgroup level , and use it as a tracking point to return cgroup ID which is current or ancestor cgroup ID depending on the level. Also make sure to return cgroup name or docker field only if we are at the same tracking level, anything deeper we ignore and if the cgroup is being tracking during cgroup_mkdir() then the name will be in the bpf map and we use as a docker field.
Main Patch - 39 tetragon: improve how we lookup container ID or docker field This is the main patch at the user space level that improves how we lookup container IDs , please see patch git log for detailed explanation
Patches 40 - 42 pkg:cgroups: make cgroups info logging more user friendly More cgroup package improvement in regards to logging and debugging messages.
Patches 43 -> last: testsing some part of the cgroups golang package.

Aug 25 '22 17:08 tixxdz

Failing on old kernels 4.19.256 was tracked to this function being too big: https://github.com/cilium/tetragon/pull/225/commits/fb5b73fb1fa3a61c69c71f09297c5a8c5768ef1d#diff-ab356ac37169a22ef1641e19014bbe9f635d530780bcad21a18caf96e2ec00d0R585 will fix it.

Aug 25 '22 17:08 tixxdz

This PR was split into multiple issues and PRs to get it merged:

Tasks:

[ ] Merge basic BPF cgroups tracking functionality is here: https://github.com/cilium/tetragon/issues/477
[ ] Merge follow cgroups tasks from this PR

Oct 14 '22 10:10 tixxdz

Marking this as draft. Once https://github.com/cilium/tetragon/issues/477 is merged, we will continue on it.

Oct 14 '22 15:10 kkourt

tetragon tetragon copied to clipboard

Tetragon: rework cgroups handling

TODO

Followups

tetragon
tetragon copied to clipboard