Duplicated traces in opentelemetry2
Bug description
Running https://github.com/apache/camel-quarkus-examples/tree/main/observability with Jaeger can produce similar traces flow:
We can see there are duplicates (like camel-quarkus-observabilty POST).
It is probably similar issue the opentelemetry had and was solved by https://github.com/apache/camel-quarkus/blob/main/extensions/opentelemetry/runtime/src/main/java/org/apache/camel/opentelemetry/CamelQuarkusOpenTelemetryTracer.java
Consider that in async components, there can be spans that looks similar but are different as one is outgoing and the other incoming. Compare the EVENT_SENT and EVENT_RECEIVED to make sure they belong to different scopes or are really the same.
@squakez There is regression from camel-quarkus-opentelemetry.
3.20 (6 traces, all with camel metadata):
3.26 (8 traces, including 2 from quarkus vert.x without camel metadata):
More than a regression is a change in the service used. I think that lately we're using camel-quarkus-opentelemetry2 which is different from camel-quarkus-opentelemetry. There could be differences in number of spans indeed and it is expected (unless there is some other bug). The vertx spans should be correctly part of the trace as there is a newer context propagation mechanism that include any 3rd party library involved in the process.
I'm talking about regression in the camel-quarkus-observability-example. Where with 3.20 it was returning different number of spans than it returns with 3.26. And it is because 3.26 is missing similar solution to https://github.com/apache/camel-quarkus/blob/main/extensions/opentelemetry/runtime/src/main/java/org/apache/camel/opentelemetry/CamelQuarkusOpenTelemetryTracer.java (excluding default Quarkus vertx spans).
Ie. when i deleted content of https://github.com/apache/camel-quarkus/blob/main/extensions/opentelemetry/runtime/src/main/java/org/apache/camel/opentelemetry/CamelQuarkusOpenTelemetryTracer.java locally, i can see 8 spans (the same as in 3.26). So even camel-quarkus-opentelemetry was able to consume 3rd party library (quarkus vertx).
Thanks for clarifying. So, I understand that the older otel extension was patching the tracing behavior for some reason I ignore. However with the newer implementation the context propagation is totally different and completely bound to Camel, so, another kind of patch would be required here as you are no longer able to peek the otel Context. I am not sure what was the rationale to remove the Vertx spans, but, as a user, I'd be more keen to keep them and maintain consistently the entire trace according to how each third party process is treating it.
@squakez Thanks for the input.
Ad I am not sure what was the rationale to remove the Vertx spans, but, as a user, I'd be more keen to keep them and maintain consistently the entire trace according to how each third party process is treating it.
I have probably different opinion, given camel-quarkus-platform-http is tightly coupled to quarkus-vertx, we could/should control it, to avoid duplication of spans. As camel (quarkus) user i want to inspect the flow of camel routes and i don't see value in seeing two identical spans.
That's the point. IMO, they are not identical. One is generated by Camel, the other is generated by the dependency. And the hint that they are different is even more evident with the new telemetry component. You can see that the camel spans always carries the exchangeId, for instance, whilst the vertx one doesn't. Also, the vertx library is exposing parameters that are specific to the http domain (client address, body size, ...) which may be interesting from an observability standpoint. If you do this with this dependency you will end up needing the same approach with many more dependencies which are adding their spans to the trace as well.
@squakez Yes, they are not identical, but from Camel point of view, they should be. I use camel component (with quarkus runtime), so i want to see just camel spans as i don't use any quarkus components consciously. Because if i see eg. two spans for one incoming HTTP request, it is misleading and it can give me struggles to figure out, why it is happening. I cannot for example map the flow of spans to the route definition, as it doesn't match due to this issue.
Likely depends on https://github.com/quarkusio/quarkus/issues/50466
How do we propose to proceed here? The Quarkus issue is closed (maybe incorrectly?).
The bug is likely still there. I don't have any further time to dedicate on it. Feel free to do any patch on our extension, although it does not sound as the best solution honestly.
patch on our extension
We probably need to do some work on the camel-opentelemetry2 component. I don't think this issue is unique to Camel Quarkus.
E.g on Spring Boot you could do:
@RestController
public class GreetingService {
@Autowired
ProducerTemplate producerTemplate;
@GetMapping("/greeting")
public String getGreeting() {
return producerTemplate.requestBody("direct:greet", null, String.class);
}
}
Similar to the scenarios above, you get disconnected spans. The HTTP request is separated from the tracing of the direct route.
It works ok with the older camel-opentelemetry component.
I think we have an e2e test covering the context propagation scenario [1]. The point is that, since we do not longer rely on thread context propagation, as it happened in the previous implementation, any consumer should instead use the "traceparent" to link to the upstream trace. Next week I'll dedicate some time taking the example provided above to further analyze this problem.
[1] https://github.com/apache/camel/blob/main/components/camel-opentelemetry2/src/test/java/org/apache/camel/opentelemetry2/SpanPropagationTest.java
Hi folks. I think we have several issues mixed here. The one on our control on core was reported in https://issues.apache.org/jira/browse/CAMEL-22648. In certain telemetry implementations we're missing to propagate the traceparent when this is not existing (ie, passed by the user). In those circumstances then, the traces are disconnected. It's a camel core problem, I'm fixing it right now.
Thanks for reporting and let's see if the fix we can do on core is enough to fix also this issue reported here.
Hello. I've made a deep analysis of the problem and I've come up to the conclusion that, this is not a bug on our side. All details in https://issues.apache.org/jira/browse/CAMEL-22648
Let me give here some more explanation. The older camel-opentelemetry implementation, was connecting spans (although was affected by inconsistency problems when running in async mode) because, since it was running on the same Thread, it was able to link the current context generated by the vertx library (via the otel agent) to the ones we were using in Camel.
The camel-opentelemetry2 has adopted a different design in order to prevent those inconsistency issues raised in several Jira tickets. We had discussed this in ML and in a PR around this new design to adopt. Unfortunately the otel agent is the real responsible of this problem as it is not doing the expected job (see more in the jira issue) to propagate downstream via the W3C, which is how the component is expecting to receive the upstream trace (according to W3C trace context specification).
You can decide if you want to do any patch or what you think it convenient for the project, but I hope I gave enough arguments to accept this not a bug on our side (it's indeed a bug on the agent). FYI, I am already trying to fix this problem on the agent and will let you know how it goes, as, in such case should clear this issue.
fix this problem on the agent
Unfortunately, otel integration with Quarkus is typically agentless. We discourage usage of it:
https://camel.apache.org/camel-quarkus/3.27.x/reference/extensions/opentelemetry2.html#extensions-opentelemetry2-usage
So maybe as hinted at here we need to open an enhancement request.
Yes. From a Camel maintainers perspective, we cannot do things different from the standard specification. From my side I'm trying to fix the agent in order to have that available generally regardless of the runtime (hence, proving that the code in Camel is consistent and follow the specification, see [1]). I don't have enough knowledge of the Quarkus runtime, I did report the problem, so, if the team wants to follow up with a fix or a new feature to follow the standard for that, it will eventually work with Camel implementation.
[1] https://github.com/apache/camel/blob/main/proposals/tracing.adoc#context-propagation