libs icon indicating copy to clipboard operation
libs copied to clipboard

Kubernetes pod labels are sometimes missing

Open Namnamseo opened this issue 1 year ago • 3 comments

Describe the bug

Hi! I'm using Falco to monitor some specific syscalls of Kubernetes pods on a GKE cluster.

It seemed to work well at first, but I've noticed that some events had incomplete fields. These events:

  • do not have k8s.pod.labels (shows up as <NA>)
  • do not have k8s.pod.label[some.valid/label] (shows up as <NA>)
  • however, have:
    • k8s.ns.name
    • k8s.pod.cni.json
    • k8s.pod.name
    • container.id
    • container.name
    • container.image.repository
    • container.image.tag

Upon inspecting the logs, I think that a pod sandbox query is sometimes failing, and it keeps staying that way.

cri (21sa65q9wwiq): Performing lookup
cri_async (21sa65q9wwiq): Starting synchronous lookup
cri (21sa65q9wwiq): Status from ContainerStatus: (an error occurred when try to find container "21sa65q9wwiq": not found)
cri (21sa65q9wwiq): id is neither a container nor a pod sandbox: an error occurred when try to find container "21sa65q9wwiq": not found
cri (21sa65q9wwiq): Failed to get metadata, returning successful=false
cri_async (21sa65q9wwiq): Source callback result=2
notify_new_container (21sa65q9wwiq): created CONTAINER_JSON event, queuing to inspector
...
v (this line repeats:)
Checking IP address of container ad8d174831f6 with incomplete metadata (in context of a6fc074bf49c; state=2)

One thing I noticed is, when I restart the Falco pod on that node, it parses the labels fine.

My weak guesses (after quickly skimming through what I've seen) are:

  • a timing issue? (pod sandbox had just been created, maybe we're too soon to query its status?)
  • cache misbehavior? (pod sandbox later gets updated, but we're insisting on our first impression that it is "neither a container nor a pod sanbdox"?)
  • a whole other issue in containerd?

One subtle issue: in these two log lines,

cri (21sa65q9wwiq): Status from ContainerStatus: (an error occurred when try to find container "21sa65q9wwiq": not found)
cri (21sa65q9wwiq): id is neither a container nor a pod sandbox: an error occurred when try to find container "21sa65q9wwiq": not found

the latter one should explain why PodSandboxStatus (not ContainerStatus) call failed, but there was a bug (using status defined earlier instead of status_pod) in the library v0.14.3. This seems to have been fixed in rigorous refactoring since.

How to reproduce it

Sorry, I couldn't have this consistently reproduced. It occurs from time to time with no pattern.

Expected behaviour

k8s.pod.labels and k8s.pod.label[some.valid/label] are always filled.

Also, the log should show up like this. (This is the log when everything is normal, and above fields are filled).

cri (b4jswe6vtwr9): Performing lookup
cri_async (b4jswe6vtwr9): Starting synchronous lookup
cri (b4jswe6vtwr9): Status from ContainerStatus: (an error occurred when try to find container "b4jswe6vtwr9": not found)
cri_async (b4jswe6vtwr9): Source callback result=1  
identify_category (131398) (pause): initial process for container, assigning CAT_CONTAINER   
adding container [b4jswe6vtwr9] group: 65535

Screenshots

Environment

  • Falco version: 0.37.1
  • System info:
{
  "machine": "x86_64",
  "nodename": "falco-6kh8r",
  "release": "5.15.0-1049-gke",
  "sysname": "Linux",
  "version": "#54-Ubuntu SMP Thu Jan 18 02:57:35 UTC 2024"
}
  • Cloud provider or hardware configuration
    • GKE, Kubernetes v1.27.11-gke.1202000
    • $ ctr version says 1.7.12-0ubuntu0~22.04.1~gke1
  • OS: Ubuntu 22.04.3 LTS
  • Kernel: 5.15.0-1049-gke
  • Installation method: falcosecurity/falco Helm chart (of version 4.2.3)
    • additional options: --disable-cri-async

Additional context

These are some logs I found relevant.

Mesos container [21sa65q9wwiq],thread [365150], has likely malformed mesos task id [], ignoring
cri (21sa65q9wwiq): Performing lookup
cri_async (21sa65q9wwiq): Starting synchronous lookup
cri (21sa65q9wwiq): Status from ContainerStatus: (an error occurred when try to find container "21sa65q9wwiq": not found)
cri (21sa65q9wwiq): id is neither a container nor a pod sandbox: an error occurred when try to find container "21sa65q9wwiq": not found
cri (21sa65q9wwiq): Failed to get metadata, returning successful=false
cri_async (21sa65q9wwiq): Source callback result=2
notify_new_container (21sa65q9wwiq): created CONTAINER_JSON event, queuing to inspector
adding container [21sa65q9wwiq] user 0
adding container [21sa65q9wwiq] group: 0
Parsing Container JSON={"container":{"Mounts":[],"User":"<NA>","cni_json":"","cpu_period":100000,"cpu_quota":0,"cpu_shares":1024,"cpuset_cpu_count":0,"created_time":19390065,
4831f6","image":"","imagedigest":"","imageid":"","imagerepo":"","imagetag":"","ip":"0.0.0.0","is_pod_sandbox":false,"labels":null,"lookup_state":2,"memory_limit":0,"metadata_
gs":[],"privileged":false,"swap_limit":0,"type":7}}
Filtering container event for failed lookup of 21sa65q9wwiq (but calling callbacks anyway)
identify_category (365151) (runc:[1:CHILD]): initial process for container, assigning CAT_CONTAINER
adding container [21sa65q9wwiq] user 0
adding container [21sa65q9wwiq] group: 0
adding container [21sa65q9wwiq] user 0
adding container [21sa65q9wwiq] group: 0

(NOTE: process tree is as follows.)
365130 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id 21sa65q9wwiq...
365151 \_ /pause
...

Namnamseo avatar Apr 02 '24 05:04 Namnamseo

Hi @Namnamseo, there is a new component released with the latest falco version called k8s-metacollector which has been developed for such use-cases. It reduces the cases where the pod metadata is missing.

Here you can find the docs on how to install it using the falco chart: https://github.com/falcosecurity/charts/tree/master/charts/falco#k8s-metacollector

alacuku avatar Apr 02 '24 07:04 alacuku

Right, I've seen those. A standalone metadata collector would really bring up the overall stability.

I only need the pod labels, so I was wondering if this can be done with just the container runtime integration.

Namnamseo avatar Apr 02 '24 10:04 Namnamseo

@Namnamseo once Falco 0.38.0 is out very soon it would be interesting to see if the container runtime socket info extraction is working better since we improved it a bit. And as @alacuku stated you also have the option to use the new k8s plugin.

incertum avatar May 15 '24 17:05 incertum

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar Aug 13 '24 22:08 poiana

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

poiana avatar Sep 12 '24 22:09 poiana