helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

403 Forbidden on PODs after host reboot

Open zimbres opened this issue 8 months ago • 11 comments

What happened?

When the host where Kubernetes is running reboot, the pods restarts with same name which give an error on agent and appsec

Error: api client register: api register (http://crowdsec-service.crowdsec:8080/) http 403 Forbidden: API error: user 'crowdsec-agent-2fkjp': user already exist

What did you expect to happen?

Some check about an existing pod already registerd

How can we reproduce it (as minimally and precisely as possible)?

Standalone K3s Chart with minimal configuration done

Anything else we need to know?

To fix the problem, after the host reboot, I have to delete the agent and appsec pods so they are recreated with new names and auto registration works

Crowdsec version

$ cscli version
version: v1.6.8-f209766e
Codename: alphaga
BuildDate: 2025-03-25_15:56:53
GoVersion: 1.24.1
Platform: docker
libre2: C++
User-Agent: crowdsec/v1.6.8-f209766e-docker
Constraint_parser: >= 1.0, <= 3.0
Constraint_scenario: >= 1.0, <= 3.0
Constraint_api: v1
Constraint_acquis: >= 1.0, < 2.0
Built-in optional components: cscli_setup, datasource_appsec, datasource_cloudwatch, datasource_docker, datasource_file, datasource_http, datasource_journalctl, datasource_k8s-audit, datasource_kafka, data
source_kinesis, datasource_loki, datasource_s3, datasource_syslog, datasource_victorialogs, datasource_wineventlog

OS version

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.2 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo
$ uname -a
Linux machine 6.8.0-56-generic crowdsecurity/crowdsec#58-Ubuntu SMP PREEMPT_DYNAMIC Fri Feb 14 15:33:28 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Enabled collections and parsers

$ cscli hub list -o raw
Loaded: 136 parsers, 10 postoverflows, 755 scenarios, 8 contexts, 4 appsec-configs, 94 appsec-rules, 134 collections
name,status,version,description,type
crowdsecurity/cri-logs,enabled,0.1,CRI logging format parser,parsers
crowdsecurity/dateparse-enrich,enabled,0.2,,parsers
crowdsecurity/docker-logs,enabled,0.1,docker json logs parser,parsers
crowdsecurity/geoip-enrich,enabled,0.5,"Populate event with geoloc info : as, country, coords, source range.",parsers
crowdsecurity/sshd-logs,enabled,2.9,Parse openSSH logs,parsers
crowdsecurity/syslog-logs,enabled,0.8,,parsers
crowdsecurity/whitelists,enabled,0.3,Whitelist events from private ipv4 addresses,parsers
crowdsecurity/ssh-bf,enabled,0.3,Detect ssh bruteforce,scenarios
crowdsecurity/ssh-cve-2024-6387,enabled,0.2,Detect exploitation attempt of CVE-2024-6387,scenarios
crowdsecurity/ssh-slow-bf,enabled,0.4,Detect slow ssh bruteforce,scenarios
crowdsecurity/bf_base,enabled,0.1,,contexts
crowdsecurity/linux,enabled,0.2,core linux support : syslog+geoip+ssh,collections
crowdsecurity/sshd,enabled,0.5,sshd support : parser and brute-force detection,collections

Acquisition config

```console # On Linux: $ cat /etc/crowdsec/acquis.yaml /etc/crowdsec/acquis.d/* filenames: - /var/log/nginx/*.log - ./tests/nginx/nginx.log #this is not a syslog log, indicate which kind of logs it is labels: type: nginx --- filenames: - /var/log/auth.log - /var/log/syslog labels: type: syslog --- filename: /var/log/apache2/*.log labels: type: apache2 cat: read error: Is a directory cat: read error: Is a directory cat: read error: Is a directory common: daemonize: false log_media: stdout log_level: info log_dir: /var/log/ config_paths: config_dir: /etc/crowdsec/ data_dir: /var/lib/crowdsec/data/ simulation_path: /etc/crowdsec/simulation.yaml hub_dir: /etc/crowdsec/hub/ index_path: /etc/crowdsec/hub/.index.json notification_dir: /etc/crowdsec/notifications/ plugin_dir: /usr/local/lib/crowdsec/plugins/ crowdsec_service: acquisition_path: /etc/crowdsec/acquis.yaml acquisition_dir: /etc/crowdsec/acquis.d parser_routines: 1 plugin_config: user: nobody group: nobody cscli: output: human db_config: log_level: info type: sqlite db_path: /var/lib/crowdsec/data/crowdsec.db flush: max_items: 5000 max_age: 7d use_wal: false api: client: insecure_skip_verify: false credentials_path: /etc/crowdsec/local_api_credentials.yaml server: log_level: info listen_uri: 0.0.0.0:8080 profiles_path: /etc/crowdsec/profiles.yaml trusted_ips: # IP ranges, or IPs which can have admin API access - 127.0.0.1 - ::1 online_client: # Central API credentials (to push signals and receive bad IPs) credentials_path: /etc/crowdsec//online_api_credentials.yaml enable: true prometheus: enabled: true level: full listen_addr: 0.0.0.0 listen_port: 6060 api: server: auto_registration: # Activate if not using TLS for authentication enabled: true token: "${REGISTRATION_TOKEN}" # /!\ Do not modify this variable (auto-generated and handled by the chart) allowed_ranges: # /!\ Make sure to adapt to the pod IP ranges used by your cluster - "127.0.0.1/32" - "192.168.0.0/16" - "10.0.0.0/8" - "172.16.0.0/12" # db_config: # type: postgresql # user: crowdsec # password: ${DB_PASSWORD} # db_name: crowdsec # host: 192.168.0.2 # port: 5432 # sslmode: require cat: read error: Is a directory share_manual_decisions: true share_tainted: true share_custom: true console_management: false share_context: true cat: read error: Is a directory common: daemonize: true log_media: stdout log_level: info config_paths: config_dir: ./config data_dir: ./data/ notification_dir: ./config/notifications/ plugin_dir: ./plugins/ #simulation_path: /etc/crowdsec/config/simulation.yaml #hub_dir: /etc/crowdsec/hub/ #index_path: ./config/hub/.index.json crowdsec_service: acquisition_path: ./config/acquis.yaml parser_routines: 1 plugin_config: user: $USER # plugin process would be ran on behalf of this user group: $USER # plugin process would be ran on behalf of this group cscli: output: human db_config: type: sqlite db_path: ./data/crowdsec.db user: root password: crowdsec db_name: crowdsec host: "172.17.0.2" port: 3306 flush: #max_items: 10000 #max_age: 168h api: client: credentials_path: ./config/local_api_credentials.yaml server: console_path: ./config/console.yaml #insecure_skip_verify: true listen_uri: 127.0.0.1:8081 profiles_path: ./config/profiles.yaml tls: #cert_file: ./cert.pem #key_file: ./key.pem online_client: # Central API credentials_path: ./config/online_api_credentials.yaml prometheus: enabled: true level: full cat: read error: Is a directory url: http://localhost:8080 login: crowdsec-lapi-67f9c4fc86-pc46p password: gCk99Oq2Bah8K2w9LrXkrV3s50DslrWcP1vqCcAwqqsiVqvfJ6P7rbUsXLhgKRhs cat: read error: Is a directory url: https://api.crowdsec.net/ login: 2116ee8c56fe4ff388d770e1cae9fde67CBBDKoyI7bTH8MW password: tMxFPduJxAav5rKJearvDnPNuBwIS8Fj9C9HoE4J7UVKMgjYg9lKjkdVVbKk3Pbl cat: read error: Is a directory cat: read error: Is a directory cat: read error: Is a directory name: default_ip_remediation #debug: true filters: - Alert.Remediation == true && Alert.GetScope() == "Ip" decisions: - type: ban duration: 4h #duration_expr: Sprintf('%dh', (GetDecisionsCount(Alert.GetValue()) + 1) * 4) # notifications: # - slack_default # Set the webhook in /etc/crowdsec/notifications/slack.yaml before enabling this. # - splunk_default # Set the splunk url and token in /etc/crowdsec/notifications/splunk.yaml before enabling this. # - http_default # Set the required http parameters in /etc/crowdsec/notifications/http.yaml before enabling this. # - email_default # Set the required email parameters in /etc/crowdsec/notifications/email.yaml before enabling this. on_success: break --- name: default_range_remediation #debug: true filters: - Alert.Remediation == true && Alert.GetScope() == "Range" decisions: - type: ban duration: 4h #duration_expr: Sprintf('%dh', (GetDecisionsCount(Alert.GetValue()) + 1) * 4) # notifications: # - slack_default # Set the webhook in /etc/crowdsec/notifications/slack.yaml before enabling this. # - splunk_default # Set the splunk url and token in /etc/crowdsec/notifications/splunk.yaml before enabling this. # - http_default # Set the required http parameters in /etc/crowdsec/notifications/http.yaml before enabling this. # - email_default # Set the required email parameters in /etc/crowdsec/notifications/email.yaml before enabling this. on_success: break cat: read error: Is a directory simulation: false # exclusions: # - crowdsecurity/ssh-bf common: daemonize: false log_media: stdout log_level: info log_dir: /var/log/ config_paths: config_dir: /etc/crowdsec/ data_dir: /var/lib/crowdsec/data #simulation_path: /etc/crowdsec/config/simulation.yaml #hub_dir: /etc/crowdsec/hub/ #index_path: ./config/hub/.index.json crowdsec_service: #acquisition_path: ./config/acquis.yaml parser_routines: 1 cscli: output: human db_config: type: sqlite db_path: /var/lib/crowdsec/data/crowdsec.db user: crowdsec #log_level: info password: crowdsec db_name: crowdsec host: "127.0.0.1" port: 3306 api: client: insecure_skip_verify: false # default true credentials_path: /etc/crowdsec/local_api_credentials.yaml server: #log_level: info listen_uri: 127.0.0.1:8080 profiles_path: /etc/crowdsec/profiles.yaml online_client: # Central API credentials_path: /etc/crowdsec/online_api_credentials.yaml prometheus: enabled: true level: full

On Windows:

C:> Get-Content C:\ProgramData\CrowdSec\config\acquis.yaml

paste output here

Config show

$ cscli config show
Global:
   - Configuration Folder   : /etc/crowdsec
   - Data Folder            : /var/lib/crowdsec/data
   - Hub Folder             : /etc/crowdsec/hub
   - Simulation File        : /etc/crowdsec/simulation.yaml
   - Log Folder             : /var/log
   - Log level              : info
   - Log Media              : stdout
Crowdsec:
  - Acquisition File        : /etc/crowdsec/acquis.yaml
  - Parsers routines        : 1
  - Acquisition Folder      : /etc/crowdsec/acquis.d
cscli:
  - Output                  : human
  - Hub Branch              :
API Client:
  - URL                     : http://localhost:8080/
  - Login                   : crowdsec-lapi-67f9c4fc86-pc46p
  - Credentials File        : /etc/crowdsec/local_api_credentials.yaml
Local API Server:
  - Listen URL              : 0.0.0.0:8080
  - Listen Socket           :
  - Profile File            : /etc/crowdsec/profiles.yaml

  - Trusted IPs:
      - 127.0.0.1
      - ::1
  - Database:
      - Type                : sqlite
      - Path                : /var/lib/crowdsec/data/crowdsec.db
      - Flush age           : 7d
      - Flush size          : 5000

Prometheus metrics

$ cscli metrics
Local API Alerts                                   │
├────────────────────────────────────────────┬───────┤
│ Reason                                     │ Count │
├────────────────────────────────────────────┼───────┤
│ crowdsecurity/http-sensitive-files         │ 12    │
│ crowdsecurity/vpatch-symfony-profiler      │ 1     │
│ crowdsecurity/CVE-2022-41082               │ 4     │
│ crowdsecurity/vpatch-CVE-2017-9841         │ 27    │
│ crowdsecurity/vpatch-CVE-2022-41082        │ 5     │
│ LePresidente/http-generic-401-bf           │ 2     │
│ crowdsecurity/appsec-vpatch                │ 8     │
│ crowdsecurity/http-cve-probing             │ 2     │
│ crowdsecurity/thinkphp-cve-2018-20062      │ 3     │
│ crowdsecurity/vpatch-CVE-2021-3129         │ 4     │
│ crowdsecurity/vpatch-CVE-2023-28121        │ 1     │
│ crowdsecurity/vpatch-CVE-2024-4577         │ 7     │
│ crowdsecurity/vpatch-git-config            │ 38    │
│ LePresidente/http-generic-403-bf           │ 8     │
│ crowdsecurity/http-admin-interface-probing │ 2     │
│ crowdsecurity/http-probing                 │ 15    │
│ crowdsecurity/vpatch-env-access            │ 220   │
│ crowdsecurity/CVE-2017-9841                │ 20    │
│ crowdsecurity/CVE-2019-18935               │ 1     │
│ crowdsecurity/http-cve-2021-41773          │ 3     │
╰────────────────────────────────────────────┴───────╯
╭────────────────────────────────────────────────────────────────╮
│ Local API Decisions                                            │
├────────────────────────────────────┬──────────┬────────┬───────┤
│ Reason                             │ Origin   │ Action │ Count │
├────────────────────────────────────┼──────────┼────────┼───────┤
│ http:bruteforce                    │ CAPI     │ ban    │ 545   │
│ http:crawl                         │ CAPI     │ ban    │ 6     │
│ http:scan                          │ CAPI     │ ban    │ 14816 │
│ ssh:bruteforce                     │ CAPI     │ ban    │ 7441  │
│ crowdsecurity/appsec-vpatch        │ crowdsec │ ban    │ 1     │
│ crowdsecurity/http-probing         │ crowdsec │ ban    │ 1     │
│ crowdsecurity/http-sensitive-files │ crowdsec │ ban    │ 1     │
│ firehol_cruzit_web_attacks         │ lists    │ ban    │ 13245 │
│ http:exploit                       │ CAPI     │ ban    │ 254   │
│ ssh:exploit                        │ CAPI     │ ban    │ 827   │
│ firehol_botscout_7d                │ lists    │ ban    │ 5528  │
│ firehol_cybercrime                 │ lists    │ ban    │ 1293  │
╰────────────────────────────────────┴──────────┴────────┴───────╯
╭──────────────────────────────────────╮
│ Local API Metrics                    │
├──────────────────────┬────────┬──────┤
│ Route                │ Method │ Hits │
├──────────────────────┼────────┼──────┤
│ /v1/allowlists       │ GET    │ 5    │
│ /v1/decisions/stream │ GET    │ 20   │
│ /v1/decisions/stream │ HEAD   │ 2    │
│ /v1/heartbeat        │ GET    │ 9    │
│ /v1/usage-metrics    │ POST   │ 2    │
│ /v1/watchers         │ POST   │ 12   │
│ /v1/watchers/login   │ POST   │ 2    │
╰──────────────────────┴────────┴──────╯
╭───────────────────────────────────────────────────────────────────╮
│ Local API Bouncers Metrics                                        │
├────────────────────────────┬──────────────────────┬────────┬──────┤
│ Bouncer                    │ Route                │ Method │ Hits │
├────────────────────────────┼──────────────────────┼────────┼──────┤
│ [email protected] │ /v1/decisions/stream │ GET    │ 10   │
│ Traefik                    │ /v1/decisions/stream │ HEAD   │ 2    │
│ [email protected]         │ /v1/decisions/stream │ GET    │ 10   │
╰────────────────────────────┴──────────────────────┴────────┴──────╯
╭──────────────────────────────────────────────────────────────────╮
│ Local API Machines Metrics                                       │
├─────────────────────────────────┬────────────────┬────────┬──────┤
│ Machine                         │ Route          │ Method │ Hits │
├─────────────────────────────────┼────────────────┼────────┼──────┤
│ crowdsec-agent-h27f2            │ /v1/heartbeat  │ GET    │ 5    │
│ crowdsec-appsec-f5b47dd44-kkknc │ /v1/allowlists │ GET    │ 5    │
│ crowdsec-appsec-f5b47dd44-kkknc │ /v1/heartbeat  │ GET    │ 4    │

Related custom configs versions (if applicable) : notification plugins, custom scenarios, parsers etc.

zimbres avatar Mar 29 '25 18:03 zimbres

@zimbres: Thanks for opening an issue, it is currently awaiting triage.

In the meantime, you can:

  1. Check Crowdsec Documentation to see if your issue can be self resolved.
  2. You can also join our Discord.
  3. Check Releases to make sure your agent is on the latest version.
Details

I am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the BirthdayResearch/oss-governance-bot repository.

github-actions[bot] avatar Mar 29 '25 18:03 github-actions[bot]

Forwarding to helm chart repository since that is the best place to get a fix since it most likley in the sidecar containers.

LaurenceJJones avatar Mar 31 '25 13:03 LaurenceJJones

@zimbres: Thanks for opening an issue, it is currently awaiting triage.

If you haven't already, please provide the following information:

  • kind : bug, enhancementor documentation
  • area : agent, appsec, configuration, cscli, local-api

In the meantime, you can:

  1. Check Crowdsec Documentation to see if your issue can be self resolved.
  2. You can also join our Discord.
  3. Check Releases to make sure your agent is on the latest version.
Details

I am a bot created to help the crowdsecurity developers manage community feedback and contributions. You can check out my manifest file to understand my behavior and what I can do. If you want to use this for your project, you can check out the forked project rr404/oss-governance-bot repository.

github-actions[bot] avatar Mar 31 '25 13:03 github-actions[bot]

I wrote a small script to restart the pods after host restart for a temporary solution.

import os
import sys
import time

from kubernetes import client, config, watch


def log_message(message):
    print(message)
    sys.stdout.flush()

def get_pods(pod_prefix):
    v1 = client.CoreV1Api()
    return [pod for pod in v1.list_pod_for_all_namespaces().items if pod.metadata.name.startswith(pod_prefix)]

def watch_logs(pod_name, namespace, init_container_name, target_message):
    v1 = client.CoreV1Api()
    w = watch.Watch()
    
    try:
        for line in w.stream(v1.read_namespaced_pod_log, name=pod_name, namespace=namespace, container=init_container_name):
            # log_message(f"[{pod_name}] {line}")
            if target_message in line:
                log_message(f"Target message found in {pod_name}, deleting pod...")
                v1.delete_namespaced_pod(name=pod_name, namespace=namespace)
                return True
    except Exception as e:
        log_message(f"Error watching logs: {e}")
    return False

def main():
    pod_list = ['crowdsec-agent-', 'crowdsec-appsec-']
    try:
        config.load_incluster_config()
    except config.ConfigException:
        config.load_kube_config()
    
    namespace = os.getenv("NAMESPACE", "crowdsec")
    target_message = os.getenv("TARGET_MESSAGE", "error condition")
    
    while True:
        for pod_prefix in pod_list:
            pods = get_pods(pod_prefix)
            for pod in pods:
                pod_name = pod.metadata.name
                for container in pod.spec.init_containers:
                    if watch_logs(pod_name, namespace, container.name, target_message):
                        break
        time.sleep(60)

if __name__ == "__main__":
    log_message(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
    log_message("Starting log-watcher...")
    log_message(f"Namespace: {os.getenv('NAMESPACE')}")
    log_message(f"Target message: {os.getenv('TARGET_MESSAGE')}")
    log_message("Watching for logs...")
    main()

zimbres avatar Apr 02 '25 10:04 zimbres

I tackle this with the Descheduler using the policy and also DefaultEvictor set to evictDaemonSetPods: true

- name: RemovePodsHavingTooManyRestarts
  args:
    podRestartThreshold: 25
    includingInitContainers: true
    states:
      - "CrashLoopBackOff"

mateuszdrab avatar Apr 13 '25 10:04 mateuszdrab

Your solution looks way more polished. I wasn't aware about this option.

zimbres avatar Apr 13 '25 19:04 zimbres

Hello,

CrowdSec 1.6.9 will introduce a feature allowing Log Processors to unregister when shut down. This should solve the issue for regular reboots (it won't cover power cuts, though). I will keep the issue open so we can confirm that it is fixed after release.

One side question: Did you do any specific configuration so the pods have consistent names?

buixor avatar Apr 15 '25 08:04 buixor

Not in my case about consistent names.

zimbres avatar Apr 18 '25 11:04 zimbres

I am facing the same issue, and power cut is quite common here. I'd like to propose the following

  • Annotate POD via daemonset with empty "internal/config"
  • In the init container, use kubectl to obtain the value - if empty perform registration then update the POD annotation value with encoded config
  • If not empty, decode it and store it into file

Would this be feasible?

Edit: Oh, and the mentioned scenario about deregistration won't work if the agent shutting down is on the same node as lapi, and lapi gets shut down first.

krezovic avatar May 04 '25 13:05 krezovic

I have provided an (overkill) implementation that has been tested by simply "reboot"-ing the node. See linked PR

krezovic avatar May 05 '25 18:05 krezovic

Hello,

CrowdSec 1.6.9 will introduce a feature allowing Log Processors to unregister when shut down. This should solve the issue for regular reboots (it won't cover power cuts, though). I will keep the issue open so we can confirm that it is fixed after release.

One side question: Did you do any specific configuration so the pods have consistent names?

Update

It doesn't work, all my pods fail with:

... level=info msg="max attempts reached for status code 401"
... level=fatal msg="crowdsec init: while initializing LAPIClient: authenticate watcher (crowdsec-agent-...): API error: ent: machine not found"

Original

Hey, I tried applying the solution you proposed with no luck, this is my values.yaml for the agents:

agent:
  tolerations:
  - key: "role"
    operator: "Equal"
    value: "cp"
    effect: "NoSchedule"
  # Specify each pod whose logs you want to process
  acquisition:
  # The namespace where the pod is located
  - namespace: network
    # The pod name
    podName: traefik-*
    # as in crowdsec configuration, we need to specify the program name to find a matching parser
    program: traefik
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
    limits:
      cpu: 200m
      memory: 512Mi
  env:
  - name: COLLECTIONS
    value: "crowdsecurity/traefik"
  - name: UNREGISTER_ON_EXIT
    value: "true"

I took the "unregister on exit" key from a PR I found implementing the functionality and in the docker documentation. I had no luck, however, something changed. One of my pods failed with:

Error: api client register: api register (http://crowdsec-service.monitoring:8080/) http 403 Forbidden: API error: user 'crowdsec-agent-....: user already exist

However, another one:

nc: bad address 'crowdsec-service.monitoring'
waiting for lapi to start
waiting for lapi to start
Error: api client register: api register (http://crowdsec-service.monitoring:8080/) http 401 Unauthorized: API error: invalid token for auto registration

I have a small k3s cluster and tried the shutdown in two different ways just in case:

  • stopping the k3s service and then powering off the machines
  • running the killall script and then powering off

both failed with the same result when turning on the cluster again.

Can you help me?

Note: I'm using the helm chart revision 0.19.4.

msd117c avatar Jul 05 '25 12:07 msd117c