integrations-core icon indicating copy to clipboard operation
integrations-core copied to clipboard

MSK Agent Integration broken on KRaft Clusters

Open jcarvalho opened this issue 1 year ago • 3 comments

Amazon MSK has recently launched support for KRaft Clusters, which adds the Controller Nodes to the output of the ListNodes API Call.

These node entries do not have a brokerNodeInfo entry, which causes the Agent Integration to crash with the following error:

2024-06-05 12:41:15 UTC | CORE | ERROR | (pkg/collector/worker/check_logger.go:71 in Error) | check:amazon_msk | Error running check: [{"message": "'BrokerNodeInfo'", "traceback": "Traceback (most recent call last):
  File \"/opt/datadog-agent/embedded/lib/python3.11/site-packages/datadog_checks/base/checks/base.py\", line 1224, in run
    self.check(instance)
  File \"/opt/datadog-agent/embedded/lib/python3.11/site-packages/datadog_checks/amazon_msk/amazon_msk.py\", line 115, in check
    broker_info = node_info['BrokerNodeInfo']
                  ~~~~~~~~~^^^^^^^^^^^^^^^^^^
KeyError: 'BrokerNodeInfo'
"}]

Ideally, the integration should also scrape the Controller Nodes (which may also expose Prometheus metrics), but it would at least be great to still support scraping the Broker Nodes when KRaft is in use.

Output of the info page

(Posting only relevant version + check information, happy to share more details over DM)

===============
Agent (v7.54.0)
===============
  Status date: 2024-06-05 12:38:51.71 UTC (1717591131710)
  Agent start: 2024-06-05 12:38:46.925 UTC (1717591126925)
  Pid: 1
  Go Version: go1.21.9
  Python Version: 3.11.8
  Build arch: amd64
  Agent flavor: agent
  Log Level: INFO


  Running Checks
  ==============

    amazon_msk (4.7.0)
    ------------------
      Instance ID: amazon_msk:c28c17180d3df175 [ERROR]
      Configuration Source: kube_services:kube_service://datadog-cluster-checks/[REDACTED]
      Total Runs: 36
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 36
      Average Execution Time : 293ms
      Last Execution Date : 2024-06-05 12:48:45 UTC (1717591725000)
      Last Successful Execution Date : Never
      Error: 'BrokerNodeInfo'
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.11/site-packages/datadog_checks/base/checks/base.py", line 1224, in run
          self.check(instance)
        File "/opt/datadog-agent/embedded/lib/python3.11/site-packages/datadog_checks/amazon_msk/amazon_msk.py", line 115, in check
          broker_info = node_info['BrokerNodeInfo']
                        ~~~~~~~~~^^^^^^^^^^^^^^^^^^
      KeyError: 'BrokerNodeInfo'

Additional environment details (Operating System, Cloud provider, etc):

Steps to reproduce the issue:

  1. Setup a KRaft-enabled MSK Cluster
  2. Setup the MSK Agent Datadog Integration
  3. Verify that the checks runner fails with an exception

Describe the results you received:

The check fails with the exception above and no metrics are published to Datadog.

Describe the results you expected:

Ideally: The metrics for both the Controller and the Brokers are published to Datadog. Desired: The metrics for the Brokers are published to Datadog.

Additional information you deem important (e.g. issue happens only occasionally):

Returned data for the ListNodes call in our KRaft-enabled cluster (redacted URLs and Account/Subnet IDs):

{
  "nodeInfoList": [
    {
      "nodeType": "CONTROLLER",
      "controllerNodeInfo": {
        "endpoints": [
          "c-10002.[redacted].kafka.us-east-1.amazonaws.com"
        ]
      }
    },
    {
      "nodeType": "CONTROLLER",
      "controllerNodeInfo": {
        "endpoints": [
          "c-10003.[redacted].kafka.us-east-1.amazonaws.com"
        ]
      }
    },
    {
      "nodeType": "CONTROLLER",
      "controllerNodeInfo": {
        "endpoints": [
          "c-10001.[redacted].kafka.us-east-1.amazonaws.com"
        ]
      }
    },
    {
      "nodeType": "BROKER",
      "nodeARN": "[redacted]",
      "instanceType": "m7g.large",
      "addedToClusterTime": "2024-06-04T14:03:11.193Z",
      "brokerNodeInfo": {
        "brokerId": 3,
        "clientVpcIpAddress": "[redacted]",
        "endpoints": [
          "b-3.[redacted].kafka.us-east-1.amazonaws.com"
        ],
        "clientSubnet": "[redacted]",
        "currentBrokerSoftwareInfo": {
          "kafkaVersion": "3.7.x.kraft"
        },
        "attachedENIId": "[redacted]"
      }
    },
    {
      "nodeType": "BROKER",
      "nodeARN": "[redacted]",
      "instanceType": "m7g.large",
      "addedToClusterTime": "2024-06-04T14:03:11.166Z",
      "brokerNodeInfo": {
        "brokerId": 2,
        "clientVpcIpAddress": "[redacted]",
        "endpoints": [
          "b-2.[redacted].kafka.us-east-1.amazonaws.com"
        ],
        "clientSubnet": "[redacted]",
        "currentBrokerSoftwareInfo": {
          "kafkaVersion": "3.7.x.kraft"
        },
        "attachedENIId": "[redacted]"
      }
    },
    {
      "nodeType": "BROKER",
      "nodeARN": "[redacted]",
      "instanceType": "m7g.large",
      "addedToClusterTime": "2024-06-04T14:03:11.139Z",
      "brokerNodeInfo": {
        "brokerId": 1,
        "clientVpcIpAddress": "[redacted]",
        "endpoints": [
          "b-1.[redacted].kafka.us-east-1.amazonaws.com"
        ],
        "clientSubnet": "[redacted]",
        "currentBrokerSoftwareInfo": {
          "kafkaVersion": "3.7.x.kraft"
        },
        "attachedENIId": "[redacted]"
      }
    }
  ]
}

jcarvalho avatar Jun 05 '24 12:06 jcarvalho

I can confirm by patching the amazon_msk.py file to add the following check in this line causes the integration to work correctly again:

            if not 'BrokerNodeInfo' in node_info:
                continue

Would be great to get this upstream to remove the local patch :slightly_smiling_face:

jcarvalho avatar Jul 12 '24 09:07 jcarvalho

Would love to see this get some progress!

notwedtm avatar Aug 08 '24 17:08 notwedtm

+1 on this issue, would really love to see this addressed!

Noojuno avatar Aug 09 '24 04:08 Noojuno

+1

wesdek avatar Dec 17 '24 08:12 wesdek

+1

quentinsch avatar Dec 17 '24 08:12 quentinsch

+1

smokieg avatar Jan 06 '25 23:01 smokieg

👋 Sorry for not getting to this sooner. But thanks for the report and the suggested fix. I'll try to get this in soon ™ .

steveny91 avatar Jan 29 '25 02:01 steveny91

Closing this as this was released in the most recent agent 7.64.0. Feel free to re-open if needed.

steveny91 avatar Mar 20 '25 18:03 steveny91

@steveny91 That's great, to hear, thanks!

Are there any plans to also fetch metrics from the Controller Nodes? With the recent release of Kafka 4.0 that removes ZooKeeper support entirely, so I'd expect KRaft clusters to become the norm, and it would be great to get the additional metrics.

jcarvalho avatar Mar 21 '25 13:03 jcarvalho