falco icon indicating copy to clipboard operation
falco copied to clipboard

Falco runtime error in k8s_replicationcontroller_handler_state for large k8s clusters (400+ nodes)

Open mac-abdon opened this issue 2 years ago • 21 comments

Describe the bug

We upgraded from falco:0.28.1 to falco:0.31.0 due to this bug in large k8s environments and we seem to have hit a new runtime error. We're now seeing:

* Setting up /usr/src links from host
* Running falco-driver-loader for: falco version=0.31.0, driver version=319368f1ad778691164d33d59945e00c5752cd27
* Running falco-driver-loader with: driver=module, compile=yes, download=yes
* Unloading falco module, if present
* Looking for a falco module locally (kernel 5.4.149-73.259.amzn2.x86_64)
* Trying to download a prebuilt falco module from https://download.falco.org/driver/319368f1ad778691164d33d59945e00c5752cd27/falco_amazonlinux2_5.4.149-73.259.amzn2.x86_64_1.ko
* Download succeeded
* Success: falco module found and inserted
Rules match ignored syscall: warning (ignored-evttype):
         loaded rules match the following events: access,brk,close,cpu_hotplug,drop,epoll_wait,eventfd,fcntl,fstat,fstat64,futex,getcwd,getdents,getdents64,getegid,geteuid,getgid,getpeername,getresgid,getresuid,getrlimit,getsockname,getsockopt,getuid,infra,k8s,llseek,lseek,lstat,lstat64,mesos,mmap,mmap2,mprotect,munmap,nanosleep,notification,page_fault,poll,ppoll,pread,preadv,procinfo,pwrite,pwritev,read,readv,recv,recvmmsg,select,semctl,semget,semop,send,sendfile,sendmmsg,setrlimit,shutdown,signaldeliver,splice,stat,stat64,switch,sysdigevent,timerfd_create,write,writev;
         but these events are not returned unless running falco with -A
2022-02-17T22:44:13+0000: Runtime error: SSL Socket handler (k8s_replicationcontroller_handler_state): Connection closed.. Exiting.

We downgraded to falco:0.30.0 which does not have the runtime error.

How to reproduce it

Upgrade to falco:0.31.0 and scale your Kubernetes cluster to around 400 nodes.

Expected behaviour

No runtime error

Screenshots

Environment

  • Falco version: 0.31.0
  • System info:
  • Cloud provider or hardware configuration: EKS v1.21.2 / ec2 instance size - r5dn.4xlarge
  • OS: Amazon Linux 2
  • Kernel: 5.4.149-73.259.amzn2.x86_64
  • Installation method: Kubernetes

Additional context

mac-abdon avatar Feb 18 '22 17:02 mac-abdon

I actually got the same issue even with less nodes (25/30): Runtime error: SSL Socket handler (k8s_replicationcontroller_handler_state): Connection closed.. Exiting.

Environment:

Falco version: 0.31.1 Openshift: 4.8.35

Diliz avatar May 04 '22 15:05 Diliz

I am seeing similar issue,

2022-05-12T01:16:36+0000: Runtime error: SSL Socket handler (k8s_namespace_handler_state): Connection closed.. Exiting.
  • Running falco-driver-loader for: falco version=0.31.0, driver version=319368f1ad778691164d33d59945e00c5752cd27
  • Running falco-driver-loader with: driver=bpf, compile=yes, download=yes

vnandha avatar May 12 '22 01:05 vnandha

Did you try running this with Falco's --k8s-node option?

jasondellaluce avatar Jun 06 '22 13:06 jasondellaluce

Did you try running this with Falco's --k8s-node option?

Yep, already tried with and without --k8s-node option, usually the falco service is crashing on the first event fetching, so it launch, wait 1 minute, then crash

EDIT: Still happening in 0.32.0

Diliz avatar Jun 14 '22 07:06 Diliz

I too am experiencing this.

jimbobby5 avatar Jul 22 '22 15:07 jimbobby5

Hello there! This was due to the falco operator which does not set the node value correctly, the --k8s-node option is set, but the nodes were not fetched correctly by the operator...

EDIT: I switched to the official helm chart since this message was posted (can be found here: https://github.com/falcosecurity/charts )

Diliz avatar Jul 22 '22 15:07 Diliz

Just landed here after seeing this in my environment as well, with 0.32.1.

IanRobertson-wpe avatar Jul 22 '22 19:07 IanRobertson-wpe

Are there any instructions how to debug?

version 0.32.1, custom crafted manifests

  • OK, on physical hw, custom k8s deployment
  • FAILS on Google cloud with BPF enabled as below
  "system_info": {
    "machine": "x86_64",
    "nodename": "falco-7xnlh",
    "release": "5.4.170+",
    "sysname": "Linux",
    "version": "#1 SMP Sat Apr 2 10:06:05 PDT 2022"
  },
  "version": "0.32.1"
❯ k logs -n monitoring falco-x58wb -f -c falco --tail 300 -p
* Setting up /usr/src links from host
* Running falco-driver-loader for: falco version=0.32.1, driver version=2.0.0+driver
* Running falco-driver-loader with: driver=bpf, compile=yes, download=yes
* Mounting debugfs
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_5.4.170+_1.o
* Skipping compilation, eBPF probe is already present in /root/.falco/falco_cos_5.4.170+_1.o
* eBPF probe located in /root/.falco/falco_cos_5.4.170+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Mon Jul 25 15:49:17 2022: Runtime error: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.. Exiting.
    spec:
      containers:
      - args:
        - /usr/bin/falco
        - --cri
        - /run/containerd/containerd.sock
        - --cri
        - /run/crio/crio.sock
        - -K
        - /var/run/secrets/kubernetes.io/serviceaccount/token
        - -k
        - https://$(KUBERNETES_SERVICE_HOST)
        - --k8s-node
        - $(FALCO_K8S_NODE_NAME)
        - -pk
        env:
        - name: FALCO_BPF_PROBE
          value: ""
        - name: FALCO_K8S_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName

When changed to

        - -k
        - http://$(KUBERNETES_SERVICE_HOST):$(KUBERNETES_SERVICE_PORT)

Then it works without problem. I do have custom image. Obviously there is some issue with /var/run/secrets/kubernetes.io/serviceaccount/token.

  -K, --k8s-api-cert (<bt_file> | <cert_file>:<key_file[#password]>[:<ca_cert_file>])
                                Use the provided files names to authenticate user and (optionally) verify the K8S API server identity. Each 
                                entry must specify full (absolute, or relative to the current directory) path to the respective file. 
                                Private key password is optional (needed only if key is password protected). CA certificate is optional. 
                                For all files, only PEM file format is supported. Specifying CA certificate only is obsoleted - when single 
                                entry is provided for this option, it will be interpreted as the name of a file containing bearer token. 
                                Note that the format of this command-line option prohibits use of files whose names contain ':' or '#' 
                                characters in the file name.

Is there any way how to get more debug info from the K8s auth?

epcim avatar Jul 26 '22 08:07 epcim

Hi @epcim, since Falco 0.32.1 you can have more debug info by adding the following args to Falco:

-o libs_logger.enabled=true -o libs_logger.severity=trace

jasondellaluce avatar Jul 26 '22 08:07 jasondellaluce

@jasondellaluce ok, so we know it has k8s connectivity and it reads some resources

FYI:

  • I realised, that with 0.32.1 the falco pod limits needs to be increased from 512MB to 1Gi (due OOM Kill)
  • The cluster I run the Falco has 25 nodes, 850 pods, but (some nodes has as only as 30 pods). So not small.
  • It reads and add [libs]: K8s [ADDED, Service the Kinds, Pod, Namespace, Service, ReplicaSet
  • Fails on daemonsets ?

Output from container:

* Setting up /usr/src links from host
* Running falco-driver-loader for: falco version=0.32.1, driver version=2.0.0+driver
* Running falco-driver-loader with: driver=bpf, compile=yes, download=yes
* Mounting debugfs
* Skipping download, eBPF probe is already present in /root/.falco/falco_cos_5.4.170+_1.o
* Skipping compilation, eBPF probe is already present in /root/.falco/falco_cos_5.4.170+_1.o
* eBPF probe located in /root/.falco/falco_cos_5.4.170+_1.o
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Tue Jul 26 10:52:14 2022: [libs]: starting live capture
Tue Jul 26 10:52:15 2022: [libs]: cri: CRI runtime: containerd 1.4.8
Tue Jul 26 10:52:15 2022: [libs]: docker_async: Creating docker async source
Tue Jul 26 10:52:15 2022: [libs]: docker_async (6a20bc616b5a): No existing container info
Tue Jul 26 10:52:15 2022: [libs]: docker_async (6a20bc616b5a): Looking up info for container via socket /var/run/docker.sock
Tue Jul 26 10:52:15 2022: [libs]: docker_async (http://localhost/v1.24/containers/6a20bc616b5a/json): Fetching url
Tue Jul 26 10:52:15 2022: [libs]: docker_async (http://localhost/v1.24/containers/6a20bc616b5a/json): http_code=200
Tue Jul 26 10:52:15 2022: [libs]: docker_async (http://localhost/v1.24/containers/6a20bc616b5a/json): returning RESP_OK
...
...
...
Tue Jul 26 10:42:49 2022: [libs]: K8s [ADDED, ReplicaSet, vk8s-ff4f6d37-22cf-44ee-af39-58c6b0f14dc3-786f859c79, c17e3f24-d389-4ddc-8c24-7531f9cb2682]
Tue Jul 26 10:42:49 2022: [libs]: K8s [ADDED, ReplicaSet, vk8s-ff4f6d37-22cf-44ee-af39-58c6b0f14dc3-7c46688b79, ca481853-3f88-4179-a90c-96f84256f5cb]
Tue Jul 26 10:42:49 2022: [libs]: K8s [ADDED, ReplicaSet, vk8s-ff4f6d37-22cf-44ee-af39-58c6b0f14dc3-854cc8b9cf, 9d75ddc1-faff-41c0-925e-0f0fbd1d9919]
Tue Jul 26 10:42:49 2022: [libs]: K8s [ADDED, ReplicaSet, vk8s-ff4f6d37-22cf-44ee-af39-58c6b0f14dc3-87d55bd54, 73149029-2d60-460d-af34-b5d7b6e47a37]
Tue Jul 26 10:42:49 2022: [libs]: K8s [ADDED, ReplicaSet, vk8s-ff4f6d37-22cf-44ee-af39-58c6b0f14dc3-f78486b56, 0d185513-64ef-40fe-baca-ef13ab283536]
Tue Jul 26 10:42:49 2022: [libs]: k8s_handler (k8s_daemonset_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:42:49 2022: [libs]: k8s_handler (k8s_daemonset_handler_state)::collect_data(), checking connection to https://10.127.0.1
Tue Jul 26 10:42:49 2022: [libs]: k8s_handler (k8s_daemonset_handler_state)::collect_data(), connected to https://10.127.0.1/apis/apps/v1/daemonsets?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: k8s_handler (k8s_daemonset_handler_state) check_enabled() enabling socket in collector
Tue Jul 26 10:42:49 2022: [libs]: k8s_handler (k8s_daemonset_handler_state)::collect_data() [https://10.127.0.1], requesting data from /apis/apps/v1/daemonsets?pretty=false... m_blocking_socket=1, m_watching=0
Tue Jul 26 10:42:49 2022: [libs]: k8s_handler (k8s_daemonset_handler_state) sending request to https://10.127.0.1/apis/apps/v1/daemonsets?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_daemonset_handler_state) socket=169, m_ssl_connection=61592496
Tue Jul 26 10:42:49 2022: [libs]: GET /apis/apps/v1/daemonsets?pretty=false HTTP/1.1
User-Agent: falcosecurity-libs
Host: 10.127.0.1:443
Accept: */*
Authorization: Bearer eyJhbGciOiJSUzI1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_daemonset_handler_state) Retrieving all data in blocking mode ...
Tue Jul 26 10:42:49 2022: [libs]: Error fetching K8s data: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.
Tue Jul 26 10:42:49 2022: Runtime error: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.. Exiting.
Tue Jul 26 10:42:49 2022: [libs]: docker_async: Source destructor
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_deployment_handler_state) closing connection to https://10.127.0.1/apis/apps/v1/deployments?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_daemonset_handler_state) closing connection to https://10.127.0.1/apis/apps/v1/daemonsets?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_replicaset_handler_state) closing connection to https://10.127.0.1/apis/apps/v1/replicasets?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_service_handler_state) closing connection to https://10.127.0.1/api/v1/services?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_replicationcontroller_handler_state) closing connection to https://10.127.0.1/api/v1/replicationcontrollers?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_pod_handler_state) closing connection to https://10.127.0.1/api/v1/pods?fieldSelector=status.phase!=Failed,status.phase!=Unknown,status.phase!=Succeeded,spec.nodeName=gke-gc01-int-ves-io-gc01-int-ves-io-p-cf25a10e-w3qi&pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_namespace_handler_state) closing connection to https://10.127.0.1/api/v1/namespaces?pretty=false
Tue Jul 26 10:42:49 2022: [libs]: Socket handler (k8s_node_handler_state) closing connection to https://10.127.0.1/api/v1/nodes?pretty=false

shall not be due rights, but for clarity, the ClusterRole:

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: {{ default "falco" .falco_app_name }}-read
  namespace: {{ if .falco_namespace }}{{ .falco_namespace }}{{ else }}monitoring{{ end }}
  labels:
    app: falco
    component: falco
    role: security
rules:
  - apiGroups:
      - extensions
      - ""
    resources:
      - nodes
      - namespaces
      - pods
      - replicationcontrollers
      - replicasets
      - services
      - daemonsets
      - deployments
      - events
      - configmaps
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - apps
    resources:
      - daemonsets
      - deployments
      - replicasets
      - statefulsets
    verbs:
      - get
      - list
      - watch
  - nonResourceURLs:
      - /healthz
      - /healthz/*
    verbs:
      - get


Here is some order of things that happens

❯ k logs -n monitoring $POD -c falco -p | egrep '(ready: |Error )'
Tue Jul 26 10:58:10 2022: [libs]: k8s_handler (k8s_api_handler_state) dependency (k8s_dummy_handler_state) ready: 1
Tue Jul 26 10:58:10 2022: [libs]: k8s_handler (k8s_api_handler_state) dependency (k8s_dummy_handler_state) ready: 1
Tue Jul 26 10:58:10 2022: [libs]: k8s_handler (k8s_api_handler_state) dependency (k8s_dummy_handler_state) ready: 1
Tue Jul 26 10:58:10 2022: [libs]: k8s_handler (k8s_api_handler_state) dependency (k8s_dummy_handler_state) ready: 1
Tue Jul 26 10:58:10 2022: [libs]: k8s_handler (k8s_node_handler_state) dependency (k8s_dummy_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_node_handler_state) dependency (k8s_dummy_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_namespace_handler_state) dependency (k8s_node_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_namespace_handler_state) dependency (k8s_node_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_pod_handler_state) dependency (k8s_namespace_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_pod_handler_state) dependency (k8s_namespace_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_replicationcontroller_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_replicationcontroller_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:58:11 2022: [libs]: k8s_handler (k8s_service_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:58:12 2022: [libs]: k8s_handler (k8s_service_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:58:12 2022: [libs]: k8s_handler (k8s_replicaset_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:58:48 2022: [libs]: k8s_handler (k8s_replicaset_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:58:49 2022: [libs]: k8s_handler (k8s_daemonset_handler_state) dependency (k8s_pod_handler_state) ready: 1
Tue Jul 26 10:58:49 2022: [libs]: Error fetching K8s data: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.

epcim avatar Jul 26 '22 11:07 epcim

Well the query to k8s to daemonsets works, so there must be some issue with processing in falco IMO.

Steps to reproduce

k exec -ti -n monitoring falco-zzfmk -c falco -- sh

TOKEN="$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)"; 
CACERT=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
curl -s -H "Authorization: Bearer $TOKEN" --cacert $CACERT  https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT/apis/apps/v1/daemonsets?pretty=false | head


{"kind":"DaemonSetList","apiVersion":"apps/v1","metadata":{"resourceVersion":"427824185"},"items":[{"metadata":{"name":"gke-metadata-server","namespace":"kube-system","uid":"452584fa-614e-40bd-8477-3b0781ce9dfc","resourceVersion":"417631995","generation":10,"creationTimestamp":"2020-12-16T10:33:18Z","labels":{"addonmanager.kubernetes.io/mode":"Reconcile","k8s-app":"gke-metadata-server"},
...
...

btw, I claimed it works like http://$(KUBERNETES_SERVICE_HOST):$(KUBERNETES_SERVICE_PORT) but that was not true, as it must be always https. The only difference is that falco was not crashing when http:// was used. (which I would call bug or not understand fully - as k8s annotation must fail at all, but nothing is written to logs on "info" level)

KUBERNETES_SERVICE_HOST=10.127.0.1
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_PORT_HTTPS=443

as it basically freeze here on http query

<docker collector works here..>
...
...
Tue Jul 26 11:57:45 2022: [libs]: k8s_handler (k8s_api_handler_state)::collect_data() [http://10.127.0.1:443], requesting data from /api?pretty=false... m_blocking_socket=1, m_watching=0
Tue Jul 26 11:57:45 2022: [libs]: k8s_handler (k8s_api_handler_state) sending request to http://10.127.0.1:443/api?pretty=false
Tue Jul 26 11:57:45 2022: [libs]: Socket handler (k8s_api_handler_state) socket=153, m_ssl_connection=0
Tue Jul 26 11:57:45 2022: [libs]: GET /api?pretty=false HTTP/1.1
User-Agent: falcosecurity-libs
Host: 10.127.0.1:443
Accept: */*

Tue Jul 26 11:57:45 2022: [libs]: Socket handler (k8s_api_handler_state) Retrieving all data in blocking mode ...

additionally, I though that with this in place, falco will only query metadata for it's own node..

        - --k8s-node
        - $(FALCO_K8S_NODE_NAME)

but insteady reading everything everywhere, ie: /apis/apps/v1/daemonsets?pretty=false

epcim avatar Jul 26 '22 11:07 epcim

I was able to solve this issue by cleaning up the old replicasets (had about 5k of these)

jefimm avatar Aug 13 '22 09:08 jefimm

We have the same issue with the replicasets failing, we currently have >6k replicasets. But I am not sure deleting them is practical for us. Installing different versions of Falco I have narrowed it down to the following:

Works: Falco 0.32.0 Helm chart version 1.19.4 Fails: Falco 0.32.1 Helm chart version 2.0.0

jjettenCamunda avatar Aug 22 '22 15:08 jjettenCamunda

We have the same issue with 0.32.1 Runtime error: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.. Exiting

ranjithmr avatar Aug 24 '22 07:08 ranjithmr

Same issue with 0.32.2 on a Azure cluster with 20 nodes

yyvess avatar Aug 26 '22 14:08 yyvess

either, have the same experience with 5.9.2022 :latest and 0.32.2 - GCP cluster, 25nodes.

The workaround worked. The number of delete old ReplicaSets was 2600. Then all fine. Deleting them manually on other environments is not an option!

I am using --k8s-node filter option, but I suspect falco does not reflect that when reading these replicasets.. see

# parsing these lines.. [libs]: K8s [ADDED, ReplicaSet, ...........
❯ k logs -n monitoring falco-bk4t5 -c falco -p |grep ReplicaSet | wc -l
2071

Some other pods managed to read only 506 RS and then fails.

The only error it throws and not recover is:

Mon Sep  5 12:57:15 2022: [libs]: Error fetching K8s data: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.
Mon Sep  5 12:57:15 2022: Runtime error: SSL Socket handler (k8s_daemonset_handler_state): Connection closed.. Exiting.

Again the setup:

    spec:
      containers:
      - args:
        - /usr/bin/falco
        - --cri
        - /run/containerd/containerd.sock
        - --cri
        - /run/crio/crio.sock
        - -K
        - /var/run/secrets/kubernetes.io/serviceaccount/token
        - -k
        - https://$(KUBERNETES_SERVICE_HOST)
        - --k8s-node
        - $(FALCO_K8S_NODE_NAME)
        - -pk
        - -o
        - libs_logger.enabled=true
        - -o
        - libs_logger.severity=info
        env:
        - name: FALCO_BPF_PROBE
        - name: FALCO_K8S_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - ```

@jasondellaluce could this have some attention? As this is real blocker. Even it works now on older , even small, clusters this will break any falco deployemnt.

@mac-abdon could you please remove (400+ nodes from title and mention it somewhere else).
 

epcim avatar Sep 06 '22 13:09 epcim

@jasondellaluce could this have some attention? As this is real blocker. Even it works now on older , even small, clusters this will break any falco deployemnt.

@epcim this is high priority in the project's roadmap. We're still in the process of figuring out what's the optimal way for mitigating this.

jasondellaluce avatar Sep 07 '22 08:09 jasondellaluce

I discussed this issue on the Falco Community Call today, so I'm sharing some of the information from that call for others who may be impacted.

As a workaround, you can consider removing the "-k " command-line option. I was under the impression that this option was used to grab all the (non-audit) k8s.* metadata, but this is not the case. With or without this switch, Falco will pull a subset of information from the local kubelet API (perhaps based on the uppercase -K switch, but I'm unsure). Without the lowercase "-k" switch, Falco will not be able to retrieve some metadata that is only available from the cluster API, which I believe to be the following field types (from https://falco.org/docs/rules/supported-fields/): k8s.rc.* k8s.svc.* k8s.rs.* k8s.deployment.*

Check your rules to determine whether you are using any of these, and if not, you can probably remove that switch as a workaround and get yourself back up and running until this is fixed.

IanRobertson-wpe avatar Sep 07 '22 21:09 IanRobertson-wpe

Hey @epcim

The workaround worked. The number of delete old ReplicaSets was 2600. Then all fine. Deleting them manually on other environments is not an option!

What was the exact status of those ReplicaSet (eg, availableReplicas, fullyLabeledReplicas, readyReplicas, replicas`, etc...)?

I guess the metadata of those ReplicaSets were not useful for Falco, so I'm trying to discover if we can use a fieldSelector in the query to filter out unuseful resources.

Btw,

I am using --k8s-node filter option, but I suspect falco does not reflect that when reading these replicasets.. see

The --k8s-node filter option works only for Pods, since other resources are not bound to a node. So it can't help for ReplicaSets or DaemonSets.

leogr avatar Sep 08 '22 13:09 leogr

As a workaround, you can consider removing the "-k " command-line option.

This workaround worked for us. Thanks!

falco: v0.32.2 Kubernetes(EKS): v1.21.14

EigoOda avatar Sep 08 '22 14:09 EigoOda

Hey @IanRobertson-wpe

just a note -K (uppercase) is used only if -k (lowercase) is present. So you can remove both -k and -K.

Btw, as a clarification: the Kubelet metadata is annotated on the container labels, Falco will directly fetch the metadata from the container runtime, and no connection to the Kubelet is needed.

leogr avatar Sep 08 '22 16:09 leogr

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana avatar Dec 22 '22 09:12 poiana

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

poiana avatar Jan 21 '23 09:01 poiana

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community. /close

poiana avatar Feb 20 '23 09:02 poiana

@poiana: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

poiana avatar Feb 20 '23 09:02 poiana

Has that issue been solved? I am still getting with huge, loaded nodes the error:

Defaulted container "falco" out of: falco, falcoctl-artifact-follow, falco-driver-loader (init), falcoctl-artifact-install (init)
Thu Mar 23 20:09:35 2023: Falco version: 0.34.1 (x86_64)
Thu Mar 23 20:09:35 2023: Falco initialized with configuration file: /etc/falco/falco.yaml
Thu Mar 23 20:09:35 2023: Loading rules from file /etc/falco/falco_rules.yaml
Thu Mar 23 20:09:36 2023: Loading rules from file /etc/falco/rules.d/falco-custom.yaml
Thu Mar 23 20:09:36 2023: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
Thu Mar 23 20:09:36 2023: Starting health webserver with threadiness 16, listening on port 8765
Thu Mar 23 20:09:36 2023: Enabled event sources: syscall
Thu Mar 23 20:09:36 2023: Opening capture with Kernel module
k8s_handler (k8s_replicaset_handler_state::collect_data()[https://mydomain] an error occurred while receiving data from k8s_replicaset_handler_state, m_blocking_socket=1, m_watching=0, SSL Socket handler (k8s_replicaset_handler_state): Connection closed.
k8s_handler (k8s_deployment_handler_state::collect_data()[https://mydomain] an error occurred while receiving data from k8s_deployment_handler_state, m_blocking_socket=1, m_watching=0, SSL Socket handler (k8s_deployment_handler_state): Connection closed.
k8s_handler (k8s_replicaset_handler_state::collect_data()[https://mydomain] an error occurred while receiving data from k8s_replicaset_handler_state, m_blocking_socket=1, m_watching=0, K8s k8s_handler::receive_response(): invalid call (request not sent).
k8s_handler (k8s_deployment_handler_state::collect_data()[https://mydomain] an error occurred while receiving data from k8s_deployment_handler_state, m_blocking_socket=1, m_watching=0, K8s k8s_handler::receive_response(): invalid call (request not sent).

@jasondellaluce , @leogr : FYI.

VF-mbrauer avatar Mar 23 '23 20:03 VF-mbrauer

@VF-mbrauer I see. Does it also cause Falco to terminate? cc @alacuku

jasondellaluce avatar Mar 24 '23 08:03 jasondellaluce

@VF-mbrauer, does it recover at some point? or keeps erroring? At startup time all the falco instances connect to the api-server and they may be throttled by api-server that's why you are seeing that error.

Anyway we are working on a new k8s-client for falco that should solve the problem we have with the current implementation, please see: falcosecurity/falco#2973

alacuku avatar Mar 24 '23 08:03 alacuku

@jasondellaluce, yes it will go first run into an OOM and then Restart, after some time it gets stalled into "CrashLoopBackOff"

falco-7sgbh                     2/2     Running            0               88m
falco-fd9kz                     1/2     CrashLoopBackOff   8 (19s ago)     43m
falco-hwv6s                     1/2     CrashLoopBackOff   8 (88s ago)     56m
falco-jj5vj                     1/2     CrashLoopBackOff   9 (2m16s ago)   49m
falco-nj6mn                     1/2     CrashLoopBackOff   6 (2m17s ago)   53m
falco-q4247                     2/2     Running            0               88m
falco-q6hwl                     1/2     CrashLoopBackOff   8 (4s ago)      52m
falco-qmgmh                     1/2     CrashLoopBackOff   10 (50s ago)    57m
falco-s4v9n                     1/2     CrashLoopBackOff   8 (3m42s ago)   54m
falco-shn6m                     2/2     Running            6 (3m51s ago)   45m
falco-tbs94                     1/2     CrashLoopBackOff   8 (75s ago)     47m
falco-vvd49                     1/2     CrashLoopBackOff   7 (4m11s ago)   51m
falco-w5tc4                     1/2     CrashLoopBackOff   4 (25s ago)     34m

So to workaround that I increased the Memory a bit.

VF-mbrauer avatar Mar 24 '23 08:03 VF-mbrauer

There is some disconnection between code and config - for me metadata_download mb setting does nothing and reading the code makes sense it doesn't.

s7an-it avatar Apr 02 '23 00:04 s7an-it