nodejs-pubsub
nodejs-pubsub copied to clipboard
OpenTelemetry integration misses the point: we need to propagate the span/context!
The code where the Subscriber reads the span from the attributes just creates the span and then throws it away: https://github.com/googleapis/nodejs-pubsub/blob/99107930bbed8e29834acb2347f65efc6a39958f/src/subscriber.ts#L517-L527
A good OpenTelemetry customer would then set the span to the current span in the on("message")
handler, so that any subsequent processing is attached to the remote span.
I'll also file a Google support case.
Environment details
-
@google-cloud/pubsub
version: master / v2.17.0
Steps to reproduce
- Enable OpenTelemetry in both publisher and subscriber using
{ enableOpenTelemetryTracing: true }
options on topic and subscription. - Perform some processing that creates spans, either manually or (for example) do a Datastore Query or Put.
It can be observed that actions are linked to the subscriber span, as can also be seen in this screenshot (spans take <1ms because nothing is done inside):
Relates to #1078.
@sethmaxwl was the scope of the initial OpenTelemetry PR to just link the publish and subscribe side together? I think it missed the point for any consumer application if you want to link the publish to more downstream actions inside the subscriber.
@hermanbanken Thanks for the comments! We are in the middle of trying to harmonize our treatment of OpenTelemetry for Pub/Sub across languages/libraries. The Node one was sort of a pilot project to see how it goes, and we're hoping to make that better. Feel free to add any comments you have (or anyone who wants to comment!) and we'll add it to the discussion.
One remark I have is that it should be documented a little bit more how it is working. For example, the (semantic) key name and the data format of the propagation key could be documented. I believe it was __googPubsubTrace
-something and I think it would be nice if that could be not some magic internal constant. Also, why not use the standard w3c trace header name and format?
Good point, I think Aaron suggested the same regarding W3C trace format. Existing propagator should be reusable. I struggled to get it all working with the recent Otel-js. If you want you can raise a PR or I can have a stab at it during my next tech debt day at work
I am guess we should make the span we created the active one? Got to look a bit more into the workings of subscriber class
@hermanbanken Did you hear back from Google Support about your case?
At September 1st they said that they would working on it "next quarter".
Public tracker: https://issuetracker.google.com/issues/183608476
That tracker seems to be inaccessible now... weird.
+1
Going to revisit it this week. Have to solve it for work :)
On Sun, Feb 20, 2022 at 3:43 PM naseemkullah @.***> wrote:
+1
— Reply to this email directly, view it on GitHub https://github.com/googleapis/nodejs-pubsub/issues/1389#issuecomment-1046265027, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABXCIK5RBIXPARJ4MCLTDU4ED2FANCNFSM5DCBVHCA . You are receiving this because you commented.Message ID: @.***>
I came up with this as a workaround (there is a comment in the event listener about what I think this library should provide):
import {Message, PubSub} from '@google-cloud/pubsub';
import {
ROOT_CONTEXT,
SpanContext,
SpanStatusCode,
trace,
} from '@opentelemetry/api';
const tracer = trace.getTracer('foo');
const pubsub = new PubSub({projectId: 'abc'});
const enableOpenTelemetryTracing = {
enableOpenTelemetryTracing: true,
};
export async function subcribe() {
const subscription = pubsub.subscription(
'testSubscription',
enableOpenTelemetryTracing
);
subscription
.on('message', (message: Message) => {
// it would be convenient if everything in this handler
// apart from `handleMessage(message)` logic was provided by the GCP pub/sub library
// when enableOpenTelemetryTracing is true
const spanContext = getSpanContext(message);
if (spanContext === undefined) {
handleMessage(message);
} else {
const context = trace.setSpanContext(ROOT_CONTEXT, spanContext);
tracer.startActiveSpan('handleMessage', {}, context, span => {
try {
handleMessage(message);
span.setStatus({code: SpanStatusCode.OK});
} catch (err) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: err instanceof Error ? err.message : undefined,
});
throw err;
} finally {
span.end();
}
});
}
})
.on('error', (error: Error) => {
console.error('Received error:', error);
});
}
function handleMessage(message: Message) {
console.log('Received message:', message.data.toString());
message.ack();
}
function getSpanContext(message: Message): SpanContext | undefined {
try {
return JSON.parse(
message.attributes['googclient_OpenTelemetrySpanContext']
);
} catch (error) {
return undefined;
}
}
worth noting, while the message acknowledgment and any processing leading up to it are part of the same trace, the /google.pubsub.v1.Subscriber/ModifyAckDeadline
span is still disconnected
edit: so apparently "span links" are the proper way to link a messages consumption to its production (as @weyert menitons below), above code snippet should probably not be used
Thank you, that looks useful. I think we might be able to use span links to connect modifyackdeadline and the other cools. But to me having it not linked is already a great improvement
The tracker page is still there for me, internally at least. Weird. Anyway, nothing has happened on this yet, but it's increasing in priority right now. Definitely not forgotten.
Yeah, I hope to have a stab at it soon. So far more important stuff at work than this.
You might not want to put too much into it, just because we've got a revamped design for OT that will be rolling out across all the languages, over time. Of note, it's going to propagate the span and context by way of the W3C spec, as well as doing better about "closing the loop" on the subscriber end. We've got a prototype implementation in Go, and that just needs to be expanded across all the languages.
https://github.com/googleapis/google-cloud-go/pull/5034
If you see anything you want in the Node implementation that's missing from the Go one, please feel free to comment!
Yeah, the changes I wanted to propose via PR would be heavily based on the Go implementation :)
Let me know, if you prefer me to wait, that's fine, happy to wait for a Google implementation if it's coming in short term
Great news that this is being fixed! Just spent a day scratching my head as to why my traces were disconnected...
Sorry for the radio silence on this topic, but it's still a major objective for us. None of this is official until it's official (please don't take it as an announcement/promise), but the current thought is that we're planning to shift it to use the more standard mechanism (w3c headers), and to get it piped all the way through. A preview of the Go version produces a really nice graph in Cloud Trace, showing the time spent in each phase of the library (batching, ack leasing, etc). There isn't currently anything that will be traced in the server, but the end to end trace should at least allow you to see how long it spends in the server before being sent out.
Mentioned in the PR, but mentioning here too: there's a suuuuper early preview branch now if anyone wants to play around with what's done. I don't recommend using it in any kind of production scenario or assuming any of the interfaces or trace formats are final.
That PR is here: https://github.com/googleapis/nodejs-pubsub/pull/1659
And the branch in googleapis that anyone can make PRs against if you want: https://github.com/googleapis/nodejs-pubsub/tree/otel-2
I'll be out for a bit but plan to pick it up again in early 2023.
@feywind Just wanted to check in and see how 2023 was treating this work, any updates yet?
I've been posting updates on the PR, but I'm planning to make a preview version soon.
Tiny sample size, I admit, but quick survey for those on this thread and wanting OTel support sooner rather than later: do you consider the trace structures and span names to be part of the public interface? Would it break your code if they changed? Or is it just a human debug tool?
@feywind for me it’s a human debug tool:) I am personally not planning to generate metrics from the traces via like otel-collector
do you consider the trace structures and span names to be part of the public interface? Would it break your code if they changed? Or is it just a human debug tool?
Mainly it is human debug tool, but there might be some alerting around it, I would expect that changes to span names or changes in amount of spans emitted to be mentioned in changelog/release notes. Mainly because there is some little cost attached to each single span.
Pretty much entirely human debug tool, but mentioning any significant changes in the changelog would be helpful for updating any internal process documentation/etc.
Same here as others have said - only used for some alerting and saved searches in grafana, "production code" doesn't care
Same here.
Thanks everyone, I appreciate the input. Our main goal for the moment is more of a human debug solution, too, so I think we're all aligned. :) A sort of "soft interface" is an interesting idea, not needing a whole new major as long as updates are mentioned consistently somewhere. That'll be good to take into account.
That #1833 is the version rebased onto main
(4.x). I'm planning to get that released and then go back to do any final backporting to 3.x.
Quick update, there are some beta testing instructions over here if anyone is interested.
Has anyone had a chance to try this out? There are a few things that need to be finished up to get this released, but it'll probably be pretty soon.
I have tried it during local development and works pretty well for me!