integrations-core icon indicating copy to clipboard operation
integrations-core copied to clipboard

[BUG] KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}

Open froque opened this issue 1 year ago • 7 comments

Agent Environment

$ sudo datadog-agent version 
Agent 7.54.1 - Commit: 44d1992 - Serialization version: v5.0.114 - Go version: go1.21.9

Describe what happened:

After upgrading to 7.54.0, Kafka consumer lag checks started to fail

Describe what you expected:

Expected Datadog Agent to continue to get Kafka consumer lag offsets from Kafka cluster.

Steps to reproduce the issue:

  • Upgrade to v7.54.0 or v7.54.1
  • Configure Datadog to check Kafka consumer offsets
$ sudo cat /etc/datadog-agent/conf.d/kafka_consumer.d/conf.yaml
init_config:

instances:
  - kafka_connect_str:
      - <redacted>
    security_protocol: SASL_SSL
    sasl_mechanism: PLAIN
    sasl_plain_username: <redacted>

    sasl_plain_password: <redacted>

    kafka_consumer_offsets: true
    monitor_unlisted_consumer_groups: true
  • perform a check
$ sudo datadog-agent check kafka_consumer


  Running Checks
  ==============
    
    kafka_consumer (4.3.0)
    ----------------------
      Instance ID: kafka_consumer:24b8757764ea1a30 [ERROR]
      Configuration Source: file:/etc/datadog-agent/conf.d/kafka_consumer.d/conf.yaml
      Total Runs: 1
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 5.099s
      Last Execution Date : 2024-06-24 09:11:07 WEST / 2024-06-24 08:11:07 UTC (1719216667000)
      Last Successful Execution Date : Never
      Error: Unable to connect to the AdminClient. This is likely due to an error in the configuration.
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.11/site-packages/datadog_checks/kafka_consumer/kafka_consumer.py", line 34, in check
          self.client.request_metadata_update()
        File "/opt/datadog-agent/embedded/lib/python3.11/site-packages/datadog_checks/kafka_consumer/client.py", line 180, in request_metadata_update
          self.kafka_client.list_topics(None, timeout=self.config._request_timeout)
        File "/opt/datadog-agent/embedded/lib/python3.11/site-packages/confluent_kafka/admin/__init__.py", line 603, in list_topics
          return super(AdminClient, self).list_topics(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      cimpl.KafkaException: KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"}
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.11/site-packages/datadog_checks/base/checks/base.py", line 1224, in run
          self.check(instance)
        File "/opt/datadog-agent/embedded/lib/python3.11/site-packages/datadog_checks/kafka_consumer/kafka_consumer.py", line 36, in check
          raise Exception(
      Exception: Unable to connect to the AdminClient. This is likely due to an error in the configuration.

  Metadata
  ========
    config.hash: kafka_consumer:24b8757764ea1a30
    config.provider: file

Additional environment details (Operating System, Cloud provider, etc):

froque avatar Jun 24 '24 08:06 froque

As a workaround, disabling tls_verify or setting tls_ca_cert works

$ tail -n2 /etc/datadog-agent/conf.d/kafka_consumer.d/conf.yaml
    tls_verify: false
    tls_ca_cert: /opt/datadog-agent/embedded/ssl/certs/cacert.pem

froque avatar Jun 24 '24 09:06 froque

Hello @froque! Thanks for opening this issue and the workaround. I'm going to transfer the issue to integrations-core because this is where the integrations lives. I'll let them know so they'll be able to take care of this.

FlorentClarret avatar Jun 24 '24 09:06 FlorentClarret

@froque can you open a support case? Also, you can use the script in tests/python_client/script.py to run a barebones connection directly to the cluster for debugging. This script will attempt a connection and then fetch all of the consumer groups for that configuration. Please include it with the support case along with a Debug flare.

HadhemiDD avatar Jun 24 '24 11:06 HadhemiDD

$ /opt/datadog-agent/embedded/bin/python script.py 
bootstrap.servers=<redacted>
socket.timeout.ms=5000
client.id=dd-agent
security.protocol=sasl_ssl
ssl.endpoint.identification.algorithm=none
enable.ssl.certificate.verification=true
sasl.mechanism=PLAIN
sasl.username=<redacted>
sasl.password=*****
Connecting to AdminClient
%3|1719239854.080|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:16000069:STORE routines::unregistered scheme: scheme=file
%3|1719239854.080|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:80000002:system library::No such file or directory: calling stat(/usr/local/ssl/certs)
%3|1719239854.080|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:16000069:STORE routines::unregistered scheme: scheme=file
%3|1719239854.080|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:80000002:system library::No such file or directory: calling stat(/usr/local/ssl/certs)
%3|1719239854.080|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:16000069:STORE routines::unregistered scheme: scheme=file
%3|1719239854.081|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:80000002:system library::No such file or directory: calling stat(/usr/local/ssl/certs)
%3|1719239854.081|FAIL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: SSL handshake failed: error:0A000086:SSL routines::certificate verify failed: broker certificate could not be verified, verify that ssl.ca.location is correctly configured or root CA certificates are installed (install ca-certificates package) (after 34ms in state SSL_HANDSHAKE)
%3|1719239855.009|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:16000069:STORE routines::unregistered scheme: scheme=file
%3|1719239855.009|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:80000002:system library::No such file or directory: calling stat(/usr/local/ssl/certs)
%3|1719239855.010|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:16000069:STORE routines::unregistered scheme: scheme=file
%3|1719239855.010|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:80000002:system library::No such file or directory: calling stat(/usr/local/ssl/certs)
%3|1719239855.010|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:16000069:STORE routines::unregistered scheme: scheme=file
%3|1719239855.010|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:80000002:system library::No such file or directory: calling stat(/usr/local/ssl/certs)
%3|1719239855.010|FAIL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: SSL handshake failed: error:0A000086:SSL routines::certificate verify failed: broker certificate could not be verified, verify that ssl.ca.location is correctly configured or root CA certificates are installed (install ca-certificates package) (after 32ms in state SSL_HANDSHAKE, 1 identical error(s) suppressed)
^CTraceback (most recent call last):
  File "/home/pminds/script.py", line 87, in <module>
    main()
  File "/home/pminds/script.py", line 80, in main
    results = future.result()
              ^^^^^^^^^^^^^^^
  File "/opt/datadog-agent/embedded/lib/python3.11/concurrent/futures/_base.py", line 451, in result
    self._condition.wait(timeout)
  File "/opt/datadog-agent/embedded/lib/python3.11/threading.py", line 327, in wait
    waiter.acquire()
KeyboardInterrupt

From what I have already explored, it seems that in version v7.54.0 it expects a file in /usr/local/ssl/certs and not in /opt/datadog-agent/embedded/ssl/certs/ like in v7.53.0.

froque avatar Jun 24 '24 15:06 froque

Your logs were successfully uploaded. For future reference, your internal case id is 1751844

froque avatar Jun 26 '24 08:06 froque

From what I have already explored, it seems that in version v7.54.0 it expects a file in /usr/local/ssl/certs and not in /opt/datadog-agent/embedded/ssl/certs/ like in v7.53.0.

=> @froque
Can you elaborate on where did you find this change? Also, can you try to use port 9091 instead for the kafka broker (update the config on kafka side) and set the same port on datadog side (in the script.py) then try to run the script again and see if it works?

HadhemiDD avatar Jun 27 '24 09:06 HadhemiDD

@HadhemiDD I messed around in differences between the v73 and v74 debian files.

❯ wget --quiet https://apt.datadoghq.com/pool/d/da/datadog-agent_7.53.0-1_amd64.deb
❯ wget --quiet https://apt.datadoghq.com/pool/d/da/datadog-agent_7.54.0-1_amd64.deb
❯ mkdir v7.53 v7.54
❯ ar --output v7.53 x datadog-agent_7.53.0-1_amd64.deb 
❯ ar --output v7.54 x datadog-agent_7.54.0-1_amd64.deb 
❯ tar --directory=v7.53 -Jxf v7.53/data.tar.xz
❯ tar --directory=v7.54 -Jxf v7.54/data.tar.xz

I noticed that librdkafka is no longer in the same path

❯ find -name \*librdkafka\*so\* -type f
./v7.53/opt/datadog-agent/embedded/lib/librdkafka++.so.1
./v7.53/opt/datadog-agent/embedded/lib/librdkafka.so.1
./v7.54/opt/datadog-agent/embedded/lib/python3.11/site-packages/confluent_kafka.libs/librdkafka-27145264.so.1

And a new libcrypto exists

❯ find -name \*libcrypto\*so\* -type f| sort                 
./v7.53/opt/datadog-agent/embedded/lib/libcrypto.so.3
./v7.53/opt/datadog-agent/embedded/lib/python3.11/site-packages/psycopg2_binary.libs/libcrypto-7d0e8add.so.1.1
./v7.54/opt/datadog-agent/embedded/lib/libcrypto.so.3
./v7.54/opt/datadog-agent/embedded/lib/python3.11/site-packages/aerospike.libs/libcrypto-e31f2095.so.3
./v7.54/opt/datadog-agent/embedded/lib/python3.11/site-packages/confluent_kafka.libs/libcrypto-b840c11b.so.3
./v7.54/opt/datadog-agent/embedded/lib/python3.11/site-packages/psycopg2_binary.libs/libcrypto-7d0e8add.so.1.1

searching for some strings

❯ rgrep '/opt/datadog-agent/embedded/ssl/certs' v7* 
grep: v7.53/opt/datadog-agent/embedded/lib/libcrypto.so.3: binary file matches
grep: v7.54/opt/datadog-agent/embedded/lib/libcrypto.so.3: binary file matches
❯ rgrep '/usr/local/ssl/certs' v7*
grep: v7.54/opt/datadog-agent/embedded/lib/python3.11/site-packages/confluent_kafka.libs/libcrypto-b840c11b.so.3: binary file matches
grep: v7.54/opt/datadog-agent/embedded/lib/python3.11/site-packages/aerospike.libs/libcrypto-e31f2095.so.3: binary file matches

froque avatar Jun 27 '24 14:06 froque

@HadhemiDD we're also facing this issue, it seems there are two problems:

  1. librdkafka is expecting a specific path for CA certs still (which I thought would be addressed by #17957)
  2. There's no mechanism to specify a scope from the OIDC provider

Bumping up the librdkafka log level I was able to see the following error: %3|1743455351.252|OIDC|dd-agent#producer-1| [thrd:background]: Failed to retrieve OIDC token from "https://<our_saml_url>/oauth2/default/v1/token": error setting certificate file: /etc/pki/tls/certs/ca-bundle.crt (-1)

After installing ca-certificates and symlinking it to /etc/pki/tls/certs/ca-bundle.crt I got this: %3|1743455586.679|OIDC|dd-agent#producer-1| [thrd:background]: Failed to retrieve OIDC token from "https://<our_saml_url>/oauth2/default/v1/token": {"error":"invalid_scope","error_description":"The authorization server resource does not have any configured default scopes, 'scope' must be provided."} (400)

I was able to add a scope parameter to extras_parameters here to get it to work with a hardcoded value, but it would be ideal if we could specify this in the instance config instead

extras_parameters['sasl.oauthbearer.scope'] = "kafka.read"

This was from the following agent version in Docker:

Agent 7.64.1 - Commit: 154bd424d2 - Serialization version: v5.0.144 - Go version: go1.23.6

kafka_consumer (6.5.0)

ocient-cliimatta avatar Mar 31 '25 22:03 ocient-cliimatta