nodejs-pubsub icon indicating copy to clipboard operation
nodejs-pubsub copied to clipboard

OpenTelemetry integration misses the point: we need to propagate the span/context!

Open hermanbanken opened this issue 2 years ago • 32 comments

The code where the Subscriber reads the span from the attributes just creates the span and then throws it away: https://github.com/googleapis/nodejs-pubsub/blob/99107930bbed8e29834acb2347f65efc6a39958f/src/subscriber.ts#L517-L527

A good OpenTelemetry customer would then set the span to the current span in the on("message") handler, so that any subsequent processing is attached to the remote span.

I'll also file a Google support case.

Environment details

  • @google-cloud/pubsub version: master / v2.17.0

Steps to reproduce

  1. Enable OpenTelemetry in both publisher and subscriber using { enableOpenTelemetryTracing: true } options on topic and subscription.
  2. Perform some processing that creates spans, either manually or (for example) do a Datastore Query or Put.

It can be observed that actions are linked to the subscriber span, as can also be seen in this screenshot (spans take <1ms because nothing is done inside): Screenshot 2021-08-30 at 16 47 23

Relates to #1078.

hermanbanken avatar Aug 30 '21 14:08 hermanbanken

@sethmaxwl was the scope of the initial OpenTelemetry PR to just link the publish and subscribe side together? I think it missed the point for any consumer application if you want to link the publish to more downstream actions inside the subscriber.

hermanbanken avatar Aug 30 '21 14:08 hermanbanken

@hermanbanken Thanks for the comments! We are in the middle of trying to harmonize our treatment of OpenTelemetry for Pub/Sub across languages/libraries. The Node one was sort of a pilot project to see how it goes, and we're hoping to make that better. Feel free to add any comments you have (or anyone who wants to comment!) and we'll add it to the discussion.

feywind avatar Aug 30 '21 15:08 feywind

One remark I have is that it should be documented a little bit more how it is working. For example, the (semantic) key name and the data format of the propagation key could be documented. I believe it was __googPubsubTrace-something and I think it would be nice if that could be not some magic internal constant. Also, why not use the standard w3c trace header name and format?

hermanbanken avatar Sep 06 '21 08:09 hermanbanken

Good point, I think Aaron suggested the same regarding W3C trace format. Existing propagator should be reusable. I struggled to get it all working with the recent Otel-js. If you want you can raise a PR or I can have a stab at it during my next tech debt day at work

I am guess we should make the span we created the active one? Got to look a bit more into the workings of subscriber class

weyert avatar Sep 11 '21 02:09 weyert

@hermanbanken Did you hear back from Google Support about your case?

weyert avatar Oct 20 '21 14:10 weyert

At September 1st they said that they would working on it "next quarter".

Public tracker: https://issuetracker.google.com/issues/183608476

hermanbanken avatar Oct 20 '21 14:10 hermanbanken

That tracker seems to be inaccessible now... weird.

hermanbanken avatar Oct 20 '21 14:10 hermanbanken

+1

naseemkullah avatar Feb 20 '22 15:02 naseemkullah

Going to revisit it this week. Have to solve it for work :)

On Sun, Feb 20, 2022 at 3:43 PM naseemkullah @.***> wrote:

+1

— Reply to this email directly, view it on GitHub https://github.com/googleapis/nodejs-pubsub/issues/1389#issuecomment-1046265027, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABXCIK5RBIXPARJ4MCLTDU4ED2FANCNFSM5DCBVHCA . You are receiving this because you commented.Message ID: @.***>

weyert avatar Feb 20 '22 16:02 weyert

I came up with this as a workaround (there is a comment in the event listener about what I think this library should provide):

import {Message, PubSub} from '@google-cloud/pubsub';
import {
  ROOT_CONTEXT,
  SpanContext,
  SpanStatusCode,
  trace,
} from '@opentelemetry/api';

const tracer = trace.getTracer('foo');

const pubsub = new PubSub({projectId: 'abc'});

const enableOpenTelemetryTracing = {
  enableOpenTelemetryTracing: true,
};

export async function subcribe() {
  const subscription = pubsub.subscription(
    'testSubscription',
    enableOpenTelemetryTracing
  );

  subscription
    .on('message', (message: Message) => {
      // it would be convenient if everything in this handler
      // apart from `handleMessage(message)` logic was provided by the GCP pub/sub library
      // when enableOpenTelemetryTracing is true
      const spanContext = getSpanContext(message);

      if (spanContext === undefined) {
        handleMessage(message);
      } else {
        const context = trace.setSpanContext(ROOT_CONTEXT, spanContext);

        tracer.startActiveSpan('handleMessage', {}, context, span => {
          try {
            handleMessage(message);
            span.setStatus({code: SpanStatusCode.OK});
          } catch (err) {
            span.setStatus({
              code: SpanStatusCode.ERROR,
              message: err instanceof Error ? err.message : undefined,
            });
            throw err;
          } finally {
            span.end();
          }
        });
      }
    })
    .on('error', (error: Error) => {
      console.error('Received error:', error);
    });
}

function handleMessage(message: Message) {
  console.log('Received message:', message.data.toString());
  message.ack();
}

function getSpanContext(message: Message): SpanContext | undefined {
  try {
    return JSON.parse(
      message.attributes['googclient_OpenTelemetrySpanContext']
    );
  } catch (error) {
    return undefined;
  }
}

worth noting, while the message acknowledgment and any processing leading up to it are part of the same trace, the /google.pubsub.v1.Subscriber/ModifyAckDeadline span is still disconnected

edit: so apparently "span links" are the proper way to link a messages consumption to its production (as @weyert menitons below), above code snippet should probably not be used

naseemkullah avatar Feb 21 '22 01:02 naseemkullah

Thank you, that looks useful. I think we might be able to use span links to connect modifyackdeadline and the other cools. But to me having it not linked is already a great improvement

weyert avatar Feb 21 '22 02:02 weyert

The tracker page is still there for me, internally at least. Weird. Anyway, nothing has happened on this yet, but it's increasing in priority right now. Definitely not forgotten.

feywind avatar Mar 24 '22 19:03 feywind

Yeah, I hope to have a stab at it soon. So far more important stuff at work than this.

weyert avatar Mar 24 '22 22:03 weyert

You might not want to put too much into it, just because we've got a revamped design for OT that will be rolling out across all the languages, over time. Of note, it's going to propagate the span and context by way of the W3C spec, as well as doing better about "closing the loop" on the subscriber end. We've got a prototype implementation in Go, and that just needs to be expanded across all the languages.

https://github.com/googleapis/google-cloud-go/pull/5034

If you see anything you want in the Node implementation that's missing from the Go one, please feel free to comment!

feywind avatar Mar 28 '22 19:03 feywind

Yeah, the changes I wanted to propose via PR would be heavily based on the Go implementation :)

Let me know, if you prefer me to wait, that's fine, happy to wait for a Google implementation if it's coming in short term

weyert avatar Mar 28 '22 20:03 weyert

Great news that this is being fixed! Just spent a day scratching my head as to why my traces were disconnected...

GraemeF avatar Aug 03 '22 17:08 GraemeF

Sorry for the radio silence on this topic, but it's still a major objective for us. None of this is official until it's official (please don't take it as an announcement/promise), but the current thought is that we're planning to shift it to use the more standard mechanism (w3c headers), and to get it piped all the way through. A preview of the Go version produces a really nice graph in Cloud Trace, showing the time spent in each phase of the library (batching, ack leasing, etc). There isn't currently anything that will be traced in the server, but the end to end trace should at least allow you to see how long it spends in the server before being sent out.

feywind avatar Aug 03 '22 18:08 feywind

Mentioned in the PR, but mentioning here too: there's a suuuuper early preview branch now if anyone wants to play around with what's done. I don't recommend using it in any kind of production scenario or assuming any of the interfaces or trace formats are final.

That PR is here: https://github.com/googleapis/nodejs-pubsub/pull/1659

And the branch in googleapis that anyone can make PRs against if you want: https://github.com/googleapis/nodejs-pubsub/tree/otel-2

I'll be out for a bit but plan to pick it up again in early 2023.

feywind avatar Dec 22 '22 20:12 feywind

@feywind Just wanted to check in and see how 2023 was treating this work, any updates yet?

0xdevalias avatar Feb 02 '23 05:02 0xdevalias

I've been posting updates on the PR, but I'm planning to make a preview version soon.

feywind avatar Jun 20 '23 18:06 feywind

Tiny sample size, I admit, but quick survey for those on this thread and wanting OTel support sooner rather than later: do you consider the trace structures and span names to be part of the public interface? Would it break your code if they changed? Or is it just a human debug tool?

feywind avatar Jun 23 '23 16:06 feywind

@feywind for me it’s a human debug tool:) I am personally not planning to generate metrics from the traces via like otel-collector

weyert avatar Jun 23 '23 18:06 weyert

do you consider the trace structures and span names to be part of the public interface? Would it break your code if they changed? Or is it just a human debug tool?

Mainly it is human debug tool, but there might be some alerting around it, I would expect that changes to span names or changes in amount of spans emitted to be mentioned in changelog/release notes. Mainly because there is some little cost attached to each single span.

ervin-pactum avatar Jun 24 '23 07:06 ervin-pactum

Pretty much entirely human debug tool, but mentioning any significant changes in the changelog would be helpful for updating any internal process documentation/etc.

0xdevalias avatar Jun 25 '23 23:06 0xdevalias

Same here as others have said - only used for some alerting and saved searches in grafana, "production code" doesn't care

SimenB avatar Jun 26 '23 12:06 SimenB

Same here.

danmichaelo avatar Jun 27 '23 06:06 danmichaelo

Thanks everyone, I appreciate the input. Our main goal for the moment is more of a human debug solution, too, so I think we're all aligned. :) A sort of "soft interface" is an interesting idea, not needing a whole new major as long as updates are mentioned consistently somewhere. That'll be good to take into account.

feywind avatar Jun 27 '23 14:06 feywind

That #1833 is the version rebased onto main (4.x). I'm planning to get that released and then go back to do any final backporting to 3.x.

feywind avatar Sep 20 '23 21:09 feywind

Quick update, there are some beta testing instructions over here if anyone is interested.

feywind avatar Apr 04 '24 19:04 feywind

Has anyone had a chance to try this out? There are a few things that need to be finished up to get this released, but it'll probably be pretty soon.

feywind avatar May 01 '24 17:05 feywind

I have tried it during local development and works pretty well for me!

weyert avatar May 01 '24 18:05 weyert