libs
libs copied to clipboard
Kubernetes pod labels are sometimes missing
Describe the bug
Hi! I'm using Falco to monitor some specific syscalls of Kubernetes pods on a GKE cluster.
It seemed to work well at first, but I've noticed that some events had incomplete fields. These events:
- do not have
k8s.pod.labels(shows up as<NA>) - do not have
k8s.pod.label[some.valid/label](shows up as<NA>) - however, have:
k8s.ns.namek8s.pod.cni.jsonk8s.pod.namecontainer.idcontainer.namecontainer.image.repositorycontainer.image.tag
Upon inspecting the logs, I think that a pod sandbox query is sometimes failing, and it keeps staying that way.
cri (21sa65q9wwiq): Performing lookup
cri_async (21sa65q9wwiq): Starting synchronous lookup
cri (21sa65q9wwiq): Status from ContainerStatus: (an error occurred when try to find container "21sa65q9wwiq": not found)
cri (21sa65q9wwiq): id is neither a container nor a pod sandbox: an error occurred when try to find container "21sa65q9wwiq": not found
cri (21sa65q9wwiq): Failed to get metadata, returning successful=false
cri_async (21sa65q9wwiq): Source callback result=2
notify_new_container (21sa65q9wwiq): created CONTAINER_JSON event, queuing to inspector
...
v (this line repeats:)
Checking IP address of container ad8d174831f6 with incomplete metadata (in context of a6fc074bf49c; state=2)
One thing I noticed is, when I restart the Falco pod on that node, it parses the labels fine.
My weak guesses (after quickly skimming through what I've seen) are:
- a timing issue? (pod sandbox had just been created, maybe we're too soon to query its status?)
- cache misbehavior? (pod sandbox later gets updated, but we're insisting on our first impression that it is "neither a container nor a pod sanbdox"?)
- a whole other issue in containerd?
One subtle issue: in these two log lines,
cri (21sa65q9wwiq): Status from ContainerStatus: (an error occurred when try to find container "21sa65q9wwiq": not found)
cri (21sa65q9wwiq): id is neither a container nor a pod sandbox: an error occurred when try to find container "21sa65q9wwiq": not found
the latter one should explain why PodSandboxStatus (not ContainerStatus) call failed, but there was a bug (using status defined earlier instead of status_pod) in the library v0.14.3.
This seems to have been fixed in rigorous refactoring since.
How to reproduce it
Sorry, I couldn't have this consistently reproduced. It occurs from time to time with no pattern.
Expected behaviour
k8s.pod.labels and k8s.pod.label[some.valid/label] are always filled.
Also, the log should show up like this. (This is the log when everything is normal, and above fields are filled).
cri (b4jswe6vtwr9): Performing lookup
cri_async (b4jswe6vtwr9): Starting synchronous lookup
cri (b4jswe6vtwr9): Status from ContainerStatus: (an error occurred when try to find container "b4jswe6vtwr9": not found)
cri_async (b4jswe6vtwr9): Source callback result=1
identify_category (131398) (pause): initial process for container, assigning CAT_CONTAINER
adding container [b4jswe6vtwr9] group: 65535
Screenshots
Environment
- Falco version: 0.37.1
- System info:
{
"machine": "x86_64",
"nodename": "falco-6kh8r",
"release": "5.15.0-1049-gke",
"sysname": "Linux",
"version": "#54-Ubuntu SMP Thu Jan 18 02:57:35 UTC 2024"
}
- Cloud provider or hardware configuration
- GKE, Kubernetes
v1.27.11-gke.1202000 $ ctr versionsays1.7.12-0ubuntu0~22.04.1~gke1
- GKE, Kubernetes
- OS:
Ubuntu 22.04.3 LTS - Kernel:
5.15.0-1049-gke - Installation method:
falcosecurity/falcoHelm chart (of version 4.2.3)- additional options:
--disable-cri-async
- additional options:
Additional context
These are some logs I found relevant.
Mesos container [21sa65q9wwiq],thread [365150], has likely malformed mesos task id [], ignoring
cri (21sa65q9wwiq): Performing lookup
cri_async (21sa65q9wwiq): Starting synchronous lookup
cri (21sa65q9wwiq): Status from ContainerStatus: (an error occurred when try to find container "21sa65q9wwiq": not found)
cri (21sa65q9wwiq): id is neither a container nor a pod sandbox: an error occurred when try to find container "21sa65q9wwiq": not found
cri (21sa65q9wwiq): Failed to get metadata, returning successful=false
cri_async (21sa65q9wwiq): Source callback result=2
notify_new_container (21sa65q9wwiq): created CONTAINER_JSON event, queuing to inspector
adding container [21sa65q9wwiq] user 0
adding container [21sa65q9wwiq] group: 0
Parsing Container JSON={"container":{"Mounts":[],"User":"<NA>","cni_json":"","cpu_period":100000,"cpu_quota":0,"cpu_shares":1024,"cpuset_cpu_count":0,"created_time":19390065,
4831f6","image":"","imagedigest":"","imageid":"","imagerepo":"","imagetag":"","ip":"0.0.0.0","is_pod_sandbox":false,"labels":null,"lookup_state":2,"memory_limit":0,"metadata_
gs":[],"privileged":false,"swap_limit":0,"type":7}}
Filtering container event for failed lookup of 21sa65q9wwiq (but calling callbacks anyway)
identify_category (365151) (runc:[1:CHILD]): initial process for container, assigning CAT_CONTAINER
adding container [21sa65q9wwiq] user 0
adding container [21sa65q9wwiq] group: 0
adding container [21sa65q9wwiq] user 0
adding container [21sa65q9wwiq] group: 0
(NOTE: process tree is as follows.)
365130 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 21sa65q9wwiq...
365151 \_ /pause
...
Hi @Namnamseo, there is a new component released with the latest falco version called k8s-metacollector which has been developed for such use-cases. It reduces the cases where the pod metadata is missing.
Here you can find the docs on how to install it using the falco chart: https://github.com/falcosecurity/charts/tree/master/charts/falco#k8s-metacollector
Right, I've seen those. A standalone metadata collector would really bring up the overall stability.
I only need the pod labels, so I was wondering if this can be done with just the container runtime integration.
@Namnamseo once Falco 0.38.0 is out very soon it would be interesting to see if the container runtime socket info extraction is working better since we improved it a bit. And as @alacuku stated you also have the option to use the new k8s plugin.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle rotten