dd-trace-js icon indicating copy to clipboard operation
dd-trace-js copied to clipboard

Fix KafkaJS instrumentation of orphaned promise

Open mastermatt opened this issue 1 year ago • 7 comments

What does this PR do?

Refactors querying the cluster ID in the KafkaJS instrumentation, so that unhandled rejections can't occur if the underlying client is unable to connect to the broker.

Motivation

Bug added in https://github.com/DataDog/dd-trace-js/pull/4808, released under 5.25.0 Similar in effect to the bug which was resolved with https://github.com/DataDog/dd-trace-js/pull/4112

The addition of the logic which added the clusterId to publish and consume spans created orphaned promises in producer and consumer methods which don't become re-exposed until the send or run methods are called respectively. This results in unhandled rejections, and by default, exits the process if those secondary methods are not called in time and admin methods fail.

Plugin Checklist

Additional Notes

Reproduction:

require('dd-trace/init')

const { Kafka } = require('kafkajs')
const { config } = require('@/config')

process.on('unhandledRejection', err => {
  console.error('unhandledRejection', err)
  throw err
})

const main = async () => {
  const producer = new Kafka({
    ...config.KAFKA,
    retry: { retries: 0 },
  }).producer()

  await producer.connect() // the error thrown here will be caught safely
  await producer.send({
    messages: [{ value: 'hello world' }],
    topic: 'my.topic',
  })
}

main().catch(err => {
  console.error('boom', err)
})

If the above file is run when no Kafka broker is running/accessible, the expected behavior is for the connect() to timeout, throw an error, and be caught and logged at the bottom with a "boom" message.

Current behavior will log the expected error from the rejected connect(), however, there will also be an unhandled rejection log with a stack trace that originates from the synchronous producer() call.

I'd like to call out that this file has exhibited the same bug twice in the last year and there doesn't seem to be any test file for the KafkaJS instrumentation. I'm willing to add tests go along with this PR, but I'd like guidance on test strategy from the DD team. The PR which added the bug only updated a test suite in datadog-plugin-kafkajs, and only covered happy paths. It's unclear to me if any new tests should live in datadog-plugin-kafkajs or datadog-instrumentations/test.

cc: @wconti27 @bengl @juan-fernandez

mastermatt avatar Feb 13 '25 20:02 mastermatt

This pull request has been marked as stale due to 90 days of inactivity. If this is still relevant, please update or comment to keep it open. If this should be kept open indefinitely, please apply the label keep-open. Otherwise, it will be automatically closed after 14 days.

github-actions[bot] avatar May 15 '25 04:05 github-actions[bot]

I have no idea who to ping to get this reviewed. @watson simply because you have recent activity on the repo and I'm taking a shot in the dark.

mastermatt avatar May 15 '25 13:05 mastermatt

@mastermatt thanks for the PR and the ping, I am going to try to take a look at it tomorrow. It is strongly appreciated that you opened the PR with the fix. We can't run our CI on external contributions, so we have to port this over.

About the tests: adding a test is definitely required. We do have unit tests and also integration tests in datadog-plugin-kafkajs and that seems the best place for additional tests.

BridgeAR avatar May 15 '25 15:05 BridgeAR

I work on it in https://github.com/DataDog/dd-trace-js/pull/5837. There seems to be another issue about potentially manipulating input data and I optimize the code for higher performance as well.

BridgeAR avatar Jun 05 '25 16:06 BridgeAR

This pull request has been marked as stale due to 90 days of inactivity. If this is still relevant, please update or comment to keep it open. If this should be kept open indefinitely, please apply the label keep-open. Otherwise, it will be automatically closed after 14 days.

github-actions[bot] avatar Sep 04 '25 04:09 github-actions[bot]

I consider it still relevant, the other PR hasn't had any movement either.

mastermatt avatar Sep 04 '25 13:09 mastermatt

This pull request has been marked as stale due to 90 days of inactivity. If this is still relevant, please update or comment to keep it open. If this should be kept open indefinitely, please apply the label keep-open. Otherwise, it will be automatically closed after 14 days.

github-actions[bot] avatar Dec 04 '25 04:12 github-actions[bot]