libs Kubernetes pod labels are sometimes missing

Describe the bug

Hi! I'm using Falco to monitor some specific syscalls of Kubernetes pods on a GKE cluster.

It seemed to work well at first, but I've noticed that some events had incomplete fields. These events:

do not have k8s.pod.labels (shows up as <NA>)
do not have k8s.pod.label[some.valid/label] (shows up as <NA>)
however, have:
- k8s.ns.name
- k8s.pod.cni.json
- k8s.pod.name
- container.id
- container.name
- container.image.repository
- container.image.tag

Upon inspecting the logs, I think that a pod sandbox query is sometimes failing, and it keeps staying that way.

cri (21sa65q9wwiq): Performing lookup
cri_async (21sa65q9wwiq): Starting synchronous lookup
cri (21sa65q9wwiq): Status from ContainerStatus: (an error occurred when try to find container "21sa65q9wwiq": not found)
cri (21sa65q9wwiq): id is neither a container nor a pod sandbox: an error occurred when try to find container "21sa65q9wwiq": not found
cri (21sa65q9wwiq): Failed to get metadata, returning successful=false
cri_async (21sa65q9wwiq): Source callback result=2
notify_new_container (21sa65q9wwiq): created CONTAINER_JSON event, queuing to inspector
...
v (this line repeats:)
Checking IP address of container ad8d174831f6 with incomplete metadata (in context of a6fc074bf49c; state=2)

One thing I noticed is, when I restart the Falco pod on that node, it parses the labels fine.

My weak guesses (after quickly skimming through what I've seen) are:

a timing issue? (pod sandbox had just been created, maybe we're too soon to query its status?)
cache misbehavior? (pod sandbox later gets updated, but we're insisting on our first impression that it is "neither a container nor a pod sanbdox"?)
a whole other issue in containerd?

One subtle issue: in these two log lines,

cri (21sa65q9wwiq): Status from ContainerStatus: (an error occurred when try to find container "21sa65q9wwiq": not found)
cri (21sa65q9wwiq): id is neither a container nor a pod sandbox: an error occurred when try to find container "21sa65q9wwiq": not found

the latter one should explain why PodSandboxStatus (not ContainerStatus) call failed, but there was a bug (using status defined earlier instead of status_pod) in the library v0.14.3. This seems to have been fixed in rigorous refactoring since.

How to reproduce it

Sorry, I couldn't have this consistently reproduced. It occurs from time to time with no pattern.

Expected behaviour

k8s.pod.labels and k8s.pod.label[some.valid/label] are always filled.

Also, the log should show up like this. (This is the log when everything is normal, and above fields are filled).

cri (b4jswe6vtwr9): Performing lookup
cri_async (b4jswe6vtwr9): Starting synchronous lookup
cri (b4jswe6vtwr9): Status from ContainerStatus: (an error occurred when try to find container "b4jswe6vtwr9": not found)
cri_async (b4jswe6vtwr9): Source callback result=1  
identify_category (131398) (pause): initial process for container, assigning CAT_CONTAINER   
adding container [b4jswe6vtwr9] group: 65535

Screenshots

Environment

Falco version: 0.37.1
System info:

{
  "machine": "x86_64",
  "nodename": "falco-6kh8r",
  "release": "5.15.0-1049-gke",
  "sysname": "Linux",
  "version": "#54-Ubuntu SMP Thu Jan 18 02:57:35 UTC 2024"
}

Cloud provider or hardware configuration
- GKE, Kubernetes v1.27.11-gke.1202000
- $ ctr version says 1.7.12-0ubuntu0~22.04.1~gke1
OS: Ubuntu 22.04.3 LTS
Kernel: 5.15.0-1049-gke
Installation method: falcosecurity/falco Helm chart (of version 4.2.3)
- additional options: --disable-cri-async

Additional context

These are some logs I found relevant.

Mesos container [21sa65q9wwiq],thread [365150], has likely malformed mesos task id [], ignoring
cri (21sa65q9wwiq): Performing lookup
cri_async (21sa65q9wwiq): Starting synchronous lookup
cri (21sa65q9wwiq): Status from ContainerStatus: (an error occurred when try to find container "21sa65q9wwiq": not found)
cri (21sa65q9wwiq): id is neither a container nor a pod sandbox: an error occurred when try to find container "21sa65q9wwiq": not found
cri (21sa65q9wwiq): Failed to get metadata, returning successful=false
cri_async (21sa65q9wwiq): Source callback result=2
notify_new_container (21sa65q9wwiq): created CONTAINER_JSON event, queuing to inspector
adding container [21sa65q9wwiq] user 0
adding container [21sa65q9wwiq] group: 0
Parsing Container JSON={"container":{"Mounts":[],"User":"<NA>","cni_json":"","cpu_period":100000,"cpu_quota":0,"cpu_shares":1024,"cpuset_cpu_count":0,"created_time":19390065,
4831f6","image":"","imagedigest":"","imageid":"","imagerepo":"","imagetag":"","ip":"0.0.0.0","is_pod_sandbox":false,"labels":null,"lookup_state":2,"memory_limit":0,"metadata_
gs":[],"privileged":false,"swap_limit":0,"type":7}}
Filtering container event for failed lookup of 21sa65q9wwiq (but calling callbacks anyway)
identify_category (365151) (runc:[1:CHILD]): initial process for container, assigning CAT_CONTAINER
adding container [21sa65q9wwiq] user 0
adding container [21sa65q9wwiq] group: 0
adding container [21sa65q9wwiq] user 0
adding container [21sa65q9wwiq] group: 0

(NOTE: process tree is as follows.)
365130 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 21sa65q9wwiq...
365151 \_ /pause
...

Apr 02 '24 05:04 Namnamseo

Hi @Namnamseo, there is a new component released with the latest falco version called k8s-metacollector which has been developed for such use-cases. It reduces the cases where the pod metadata is missing.

Here you can find the docs on how to install it using the falco chart: https://github.com/falcosecurity/charts/tree/master/charts/falco#k8s-metacollector

Apr 02 '24 07:04 alacuku

Right, I've seen those. A standalone metadata collector would really bring up the overall stability.

I only need the pod labels, so I was wondering if this can be done with just the container runtime integration.

Apr 02 '24 10:04 Namnamseo

@Namnamseo once Falco 0.38.0 is out very soon it would be interesting to see if the container runtime socket info extraction is working better since we improved it a bit. And as @alacuku stated you also have the option to use the new k8s plugin.

May 15 '24 17:05 incertum

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

Aug 13 '24 22:08 poiana

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

Sep 12 '24 22:09 poiana

libs libs copied to clipboard

Kubernetes pod labels are sometimes missing

libs
libs copied to clipboard