datadog-agent icon indicating copy to clipboard operation
datadog-agent copied to clipboard

[ASCII-1023] Render cluster agent using the status component

Open GustavoCaso opened this issue 5 months ago • 3 comments

What does this PR do?

Moves the cluster agent status to the status component. To do that, I had to create several status providers for each cluster agent status section.

I had tested this changes locally following this guide and my local kind cluster

Since I'm migrating one status command at a time, I need to keep the other templates for now. So, the existing templates at pkg/status/render/templates/* can not be deleted yet, as I want the other status command to work in isolation from the agent status command. I will remove those once I migrate all status commands 🔥

When running the `agent status` inside a agent cluster pod I get this text output:
============================================
Cluster Agent (v7.51.0-rc.1+git.546.5e087ba)
============================================
  Status date: 2024-02-14 17:28:37.456 UTC (1707931717456)
  Agent start: 2024-02-14 17:28:37.396 UTC (1707931717396)
  Pid: 1
  Go Version: go1.21.5
  Python Version: n/a
  Build arch: arm64
  Agent flavor: cluster_agent
  Log Level: INFO

  Paths
  =====
    Config File: /etc/datadog-agent/datadog-cluster.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

========
Hostname
========

  hostname: agent-cluster-control-plane-agent-cluster
  socket-fqdn: datadog-agent-cluster-agent-5b55c4cc87-9sk7b
  socket-hostname: datadog-agent-cluster-agent-5b55c4cc87-9sk7b
  hostname provider: container
  unused hostname providers:
    'hostname' configuration/environment: hostname is empty
    'hostname_file' configuration/environment: 'hostname_file' configuration is not enabled
    aws: not retrieving hostname from AWS: the host is not an ECS instance and other providers already retrieve non-default hostnames
    azure: azure_hostname_style is set to 'os'
    fargate: agent is not runnning on Fargate
    fqdn: FQDN hostname is not usable
    gce: unable to retrieve hostname from GCE: GCE metadata API error: Get "http://169.254.169.254/computeMetadata/v1/instance/hostname": dial tcp 169.254.169.254:80: connect: connection refused
    os: OS hostname is not usable

=========
Collector
=========


  Running Checks
  ==============

    kubernetes_apiserver
    --------------------
      Instance ID: kubernetes_apiserver [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubernetes_apiserver.d/conf.yaml.default
      Total Runs: 9
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 1, Total: 11
      Service Checks: Last Run: 3, Total: 21
      Average Execution Time : 1.465s
      Last Execution Date : 2024-02-14 17:30:40 UTC (1707931840000)
      Last Successful Execution Date : 2024-02-14 17:30:40 UTC (1707931840000)


    kubernetes_state_core
    ---------------------
      Instance ID: kubernetes_state_core:f0ece86b2bc4e82e [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubernetes_state_core.yaml.default
      Total Runs: 9
      Metric Samples: Last Run: 389, Total: 2,723
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 3, Total: 21
      Average Execution Time : 3ms
      Last Execution Date : 2024-02-14 17:30:45 UTC (1707931845000)
      Last Successful Execution Date : 2024-02-14 17:30:45 UTC (1707931845000)


    orchestrator
    ------------
      Instance ID: orchestrator:c640d4e943da6c1d [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/orchestrator.d/conf.yaml.default
      Total Runs: 14
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 5ms
      Last Execution Date : 2024-02-14 17:30:51 UTC (1707931851000)
      Last Successful Execution Date : 2024-02-14 17:30:51 UTC (1707931851000)



  Running Checks
  ==============

    kubernetes_apiserver
    --------------------
      Instance ID: kubernetes_apiserver [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubernetes_apiserver.d/conf.yaml.default
      Total Runs: 9
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 1, Total: 11
      Service Checks: Last Run: 3, Total: 21
      Average Execution Time : 1.465s
      Last Execution Date : 2024-02-14 17:30:40 UTC (1707931840000)
      Last Successful Execution Date : 2024-02-14 17:30:40 UTC (1707931840000)


    kubernetes_state_core
    ---------------------
      Instance ID: kubernetes_state_core:f0ece86b2bc4e82e [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubernetes_state_core.yaml.default
      Total Runs: 9
      Metric Samples: Last Run: 389, Total: 2,723
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 3, Total: 21
      Average Execution Time : 3ms
      Last Execution Date : 2024-02-14 17:30:45 UTC (1707931845000)
      Last Successful Execution Date : 2024-02-14 17:30:45 UTC (1707931845000)


    orchestrator
    ------------
      Instance ID: orchestrator:c640d4e943da6c1d [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/orchestrator.d/conf.yaml.default
      Total Runs: 14
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 5ms
      Last Execution Date : 2024-02-14 17:30:51 UTC (1707931851000)
      Last Successful Execution Date : 2024-02-14 17:30:51 UTC (1707931851000)


====================
Admission Controller
====================


    Webhooks info
    -------------
      MutatingWebhookConfigurations name: datadog-webhook
      Created at: 2024-02-14 17:09:19 +0000 UTC
      ---------
        Name: datadog.webhook.auto.instrumentation
        CA bundle digest: 49e58c003c325ecb
        Object selector: &LabelSelector{MatchLabels:map[string]string{admission.datadoghq.com/enabled: true,},MatchExpressions:[]LabelSelectorRequirement{},}
        Rule 1: Operations: [CREATE] - APIGroups: [] - APIVersions: [v1] - Resources: [pods]
        Service: default/datadog-agent-cluster-agent-admission-controller - Port: 443 - Path: /injectlib
      ---------
        Name: datadog.webhook.config
        CA bundle digest: 49e58c003c325ecb
        Object selector: &LabelSelector{MatchLabels:map[string]string{admission.datadoghq.com/enabled: true,},MatchExpressions:[]LabelSelectorRequirement{},}
        Rule 1: Operations: [CREATE] - APIGroups: [] - APIVersions: [v1] - Resources: [pods]
        Service: default/datadog-agent-cluster-agent-admission-controller - Port: 443 - Path: /injectconfig
      ---------
        Name: datadog.webhook.tags
        CA bundle digest: 49e58c003c325ecb
        Object selector: &LabelSelector{MatchLabels:map[string]string{admission.datadoghq.com/enabled: true,},MatchExpressions:[]LabelSelectorRequirement{},}
        Rule 1: Operations: [CREATE] - APIGroups: [] - APIVersions: [v1] - Resources: [pods]
        Service: default/datadog-agent-cluster-agent-admission-controller - Port: 443 - Path: /injecttags

    Secret info
    -----------
    Secret name: webhook-certificate
    Secret namespace: default
    Created at: 2024-02-14 17:09:19 +0000 UTC
    CA bundle digest: 49e58c003c325ecb
    Duration before certificate expiration: 8759h38m27.516535667s

=============
Autodiscovery
=============

  Enabled Features
  ================
    kubernetes
    orchestratorexplorer

==========================
Cluster Checks Dispatching
==========================

  Status: Leader, serving requests
  Active agents: 1
  Check Configurations: 0
    - Dispatched: 0
    - Unassigned: 0


=====================
Custom Metrics Server
=====================

  Disabled: The external metrics provider is not enabled on the Cluster Agent

===============
Leader Election
===============
  Leader Election Status:  Running
  Leader Name is: datadog-agent-cluster-agent-5b55c4cc87-9sk7b
  Last Acquisition of the lease: Wed, 14 Feb 2024 17:29:08 UTC
  Renewed leadership: Wed, 14 Feb 2024 17:30:38 UTC
  Number of leader transitions: 3 transitions

=====================
Orchestrator Explorer
=====================


  Collection Status: The collection is at least partially running since the cache has been populated.
  Cluster Name: agent-cluster
  Cluster ID: 177e8363-cd5d-46bc-9190-af292581b872
  Container scrubbing: enabled
  Manifest collection: enabled

  ======================
  Orchestrator Endpoints
  ======================
    https://orchestrator.datadoghq.com - API Key ending with: 72724

  ===========
  Cache Stats
  ===========
    Elements in the cache: 240

    ClusterRoleBinding
      Last Run: (Hits: 56 Miss: 0) | Total: (Hits: 560 Miss: 56)

    ClusterRole
      Last Run: (Hits: 70 Miss: 0) | Total: (Hits: 700 Miss: 70)

    Cluster
      Last Run: (Hits: 0 Miss: 1) | Total: (Hits: 0 Miss: 11)

    CronJob
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 0 Miss: 0)

    CustomResourceDefinition
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 0 Miss: 0)

    DaemonSet
      Last Run: (Hits: 3 Miss: 0) | Total: (Hits: 30 Miss: 3)

    Deployment
      Last Run: (Hits: 3 Miss: 0) | Total: (Hits: 30 Miss: 3)

    HorizontalPodAutoscaler
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 0 Miss: 0)

    Ingress
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 0 Miss: 0)

    Job
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 0 Miss: 0)

    Namespace
      Last Run: (Hits: 4 Miss: 1) | Total: (Hits: 40 Miss: 15)

    Node
      Last Run: (Hits: 1 Miss: 0) | Total: (Hits: 9 Miss: 2)

    PersistentVolumeClaim
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 0 Miss: 0)

    PersistentVolume
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 0 Miss: 0)

    Pod
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 0 Miss: 0)

    ReplicaSet
      Last Run: (Hits: 5 Miss: 0) | Total: (Hits: 50 Miss: 5)

    RoleBinding
      Last Run: (Hits: 13 Miss: 0) | Total: (Hits: 130 Miss: 13)

    Role
      Last Run: (Hits: 13 Miss: 0) | Total: (Hits: 130 Miss: 13)

    ServiceAccount
      Last Run: (Hits: 44 Miss: 0) | Total: (Hits: 440 Miss: 44)

    Service
      Last Run: (Hits: 5 Miss: 0) | Total: (Hits: 50 Miss: 5)

    StatefulSet
      Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 0 Miss: 0)

  =====================
  Manifest Buffer Stats
  =====================
  Buffer Flushed : 13 times
  Last Time Flushed Manifests : 1
  ==============================
  Manifests Flushed Per Resource
  ==============================
    ClusterRole : 70
    ClusterRoleBinding : 56
    DaemonSet : 3
    Deployment : 3
    Namespace : 15
    Node : 2
    ReplicaSet : 5
    Role : 13
    RoleBinding : 13
    Service : 5
    ServiceAccount : 44





==========
Aggregator
==========

  Checks Metric Sample: 2,771
  Dogstatsd Metric Sample: 1
  Event: 12
  Events Flushed: 11
  Number Of Flushes: 8
  Series Flushed: 2,351
  Service Check: 42
  Service Checks Flushed: 44

=========
Endpoints
=========

  https://app.datadoghq.com - API Key ending with:
      - 72724

=========
Forwarder
=========

  Transactions
  ============
    Cluster: 11
    ClusterRole: 1
    ClusterRoleBinding: 1
    CronJob: 0
    CustomResource: 0
    CustomResourceDefinition: 0
    DaemonSet: 1
    Deployment: 1
    Dropped: 44
    HighPriorityQueueFull: 0
    HorizontalPodAutoscaler: 0
    Ingress: 0
    Job: 0
    Namespace: 11
    Node: 2
    OrchestratorManifest: 11
    PersistentVolume: 0
    PersistentVolumeClaim: 0
    Pod: 0
    ReplicaSet: 1
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Role: 1
    RoleBinding: 1
    Service: 1
    ServiceAccount: 1
    StatefulSet: 0
    VerticalPodAutoscaler: 0

  Transaction Successes
  =====================
    Total number: 23
    Successes By Endpoint:
      check_run_v1: 8
      intake: 7
      series_v2: 8

  HTTP Errors
  ==================
    Total number: 44
    HTTP Errors By Code:
      403: 44

  On-disk storage
  ===============
    On-disk storage is disabled. Configure `forwarder_storage_max_size_in_bytes` to enable it.

There is one question I would like to get an answer: ~- The current cluster agent displays the logs agent information. https://github.com/DataDog/datadog-~agent/blob/77336caf87eee833a9e872b21ea30040ee0d1cc7/pkg/status/clusteragent/clusteragent.go#L39-L41. The~ ~logs agent component exposes the status provider automatically using FX. The run command for~ ~the cluster agent does not include the comp/logs/agent dependency. Should we add it to display the logs~ information ~as well? Or should we not add the logs agent to the cluster agent?~

  • There is the cluster-agent-cloudfoundry command. It does not have any status subcommand. But it requires passing the same components as for cluster-agent. Does this command actually displays any status output?

Motivation

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

Validate that the cluster-agent status output displays correctly for Text and JSON versions.

There are a few noticable changes in the cluster agent.

  • The Check Runners: 4 information is not displayed. I'm working on a separate PR to add that to the collector section
  • The Pythin version previous was not displayed now it will show as: Python Version: n/a
  • The Logs Agent section is no longer displayed
  • The order of the section has change. Is not order alphabetically.

Reviewer's Checklist

  • [ ] If known, an appropriate milestone has been selected; otherwise the Triage milestone is set.
  • [ ] Use the major_change label if your change either has a major impact on the code base, is impacting multiple teams or is changing important well-established internals of the Agent. This label will be use during QA to make sure each team pay extra attention to the changed behavior. For any customer facing change use a releasenote.
  • [ ] A release note has been added or the changelog/no-changelog label has been applied.
  • [ ] Changed code has automated tests for its functionality.
  • [ ] Adequate QA/testing plan information is provided. Except if the qa/skip-qa label, with required either qa/done or qa/no-code-change labels, are applied.
  • [ ] At least one team/.. label has been applied, indicating the team(s) that should QA this change.
  • [ ] If applicable, docs team has been notified or an issue has been opened on the documentation repo.
  • [ ] If applicable, the need-change/operator and need-change/helm labels have been applied.
  • [ ] If applicable, the k8s/<min-version> label, indicating the lowest Kubernetes version compatible with this feature.
  • [ ] If applicable, the config template has been updated.

GustavoCaso avatar Feb 06 '24 12:02 GustavoCaso