datadog-agent icon indicating copy to clipboard operation
datadog-agent copied to clipboard

Spotty kubernetes event collection

Open andor44 opened this issue 5 years ago • 9 comments

Output of the info page (if this is a bug)

» k exec -it datadog-5lnhk agent status
Getting the status from the agent.

===============
Agent (v6.10.0)
===============

  Status date: 2019-03-25 13:53:10.729646 UTC
  Pid: 380
  Python Version: 2.7.15
  Logs:
  Check Runners: 4
  Log Level: WARNING

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: -71µs
    System UTC time: 2019-03-25 13:53:10.729646 UTC

  Host Info
  =========
    bootTime: 2019-03-20 10:55:24.000000 UTC
    kernelVersion: 4.19.25-coreos
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: buster/sid
    procs: 73
    uptime: 58s
    virtualizationRole: guest
    virtualizationSystem: kvm

  Hostnames
  =========
    host_aliases: [redacted]
    hostname: redacted
    socket-fqdn: datadog-5lnhk
    socket-hostname: datadog-5lnhk
    hostname provider: container
    unused hostname providers:
      aws: not retrieving hostname from AWS: the host is not an ECS instance, and other providers already retrieve non-default hostnames
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: Get http://169.254.169.254/computeMetadata/v1/instance/hostname: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

=========
Collector
=========

  Running Checks
  ==============

    cpu
    ---
      Instance ID: cpu [OK]
      Total Runs: 29,508
      Metric Samples: Last Run: 6, Total: 177,042
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s


    disk (2.1.0)
    ------------
      Instance ID: disk:e5dffb8bef24336f [OK]
      Total Runs: 29,507
      Metric Samples: Last Run: 244, Total: 1 M
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 123ms


    docker
    ------
      Instance ID: docker [OK]
      Total Runs: 29,507
      Metric Samples: Last Run: 317, Total: 1 M
      Events: Last Run: 0, Total: 1,916
      Service Checks: Last Run: 1, Total: 29,507
      Average Execution Time : 48ms


    file_handle
    -----------
      Instance ID: file_handle [OK]
      Total Runs: 29,507
      Metric Samples: Last Run: 5, Total: 147,535
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s


    io
    --
      Instance ID: io [OK]
      Total Runs: 29,507
      Metric Samples: Last Run: 130, Total: 1 M
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s


    kubelet (2.4.0)
    ---------------
      Instance ID: kubelet:d884b5186b651429 [OK]
      Total Runs: 29,507
      Metric Samples: Last Run: 437, Total: 1 M
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 4, Total: 118,025
      Average Execution Time : 427ms


    kubernetes_apiserver
    --------------------
      Instance ID: kubernetes_apiserver [OK]
      Total Runs: 29,507
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 377
      Service Checks: Last Run: 5, Total: 225
      Average Execution Time : 100ms


    load
    ----
      Instance ID: load [OK]
      Total Runs: 29,507
      Metric Samples: Last Run: 6, Total: 177,042
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s


    memory
    ------
      Instance ID: memory [OK]
      Total Runs: 29,507
      Metric Samples: Last Run: 17, Total: 501,619
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s


    network (1.9.0)
    ---------------
      Instance ID: network:2a218184ebe03606 [OK]
      Total Runs: 29,508
      Metric Samples: Last Run: 105, Total: 1 M
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 345ms


    ntp
    ---
      Instance ID: ntp:b4579e02d1981c12 [OK]
      Total Runs: 29,507
      Metric Samples: Last Run: 1, Total: 29,507
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 29,507
      Average Execution Time : 0s


    uptime
    ------
      Instance ID: uptime [OK]
      Total Runs: 29,508
      Metric Samples: Last Run: 1, Total: 29,508
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

========
JMXFetch
========

  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  Transactions
  ============
    CheckRunsV1: 29,508
    Dropped: 0
    DroppedOnInput: 0
    Events: 0
    HostMetadata: 0
    IntakeV1: 3,718
    Metadata: 0
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Series: 0
    ServiceChecks: 0
    SketchSeries: 0
    Success: 62,734
    TimeseriesV1: 29,508

  API Keys status
  ===============
    API key ending with e5d88: API Key valid

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - redacted

==========
Logs Agent
==========

  docker
  ------
    Type: docker
    Status: OK
    Inputs: 86865f94527467d22142d81c3fd535d4ba3c824aad2df3f7d1c12ede6b131cb5

=========
Aggregator
=========
  Checks Metric Sample: 39.2 M
  Dogstatsd Metric Sample: 73,768
  Event: 2,294
  Events Flushed: 2,294
  Number Of Flushes: 29,508
  Series Flushed: 33.7 M
  Service Check: 502,328
  Service Checks Flushed: 531,835

=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 73,768
  Metric Parse Errors: 0
  Service Check Packets: 0
  Service Check Parse Errors: 0
  Udp Packet Reading Errors: 0
  Udp Packets: 73,769
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 0

Describe what happened: Kubernetes event collection breaks shortly after an agent acquires leader lock. If I kill the agent that holds the leader lock another one acquires it, collects evens for a short while (minutes) but then it also stops reporting k8s events. I can seemingly repeat this any number of times.

Weirdly enough we have another cluster where event collection seems to work fine.

You can see in the above agent output it collected 377 events and nothing on later runs.

Describe what you expected: K8s event collection to work reliably

Steps to reproduce the issue: Datadog deployed with official helm chart (values user: here) on k8s 1.12

Additional environment details (Operating System, Cloud provider, etc): CoreOS Container Linux (latest stable), on-premises

andor44 avatar Mar 25 '19 15:03 andor44

Thanks a lot ! I'll check them out !

StanGirard avatar Feb 21 '24 07:02 StanGirard

Thanks for your contributions, we'll be closing this issue as it has gone stale. Feel free to reopen if you'd like to continue the discussion.

github-actions[bot] avatar May 21 '24 08:05 github-actions[bot]