quarkus
quarkus copied to clipboard
Regression: Programmatic OpenTelemetry spans get ignored
Describe the bug
NOTE: This is a regression in Quarkus 2.13.0.CR1. In previous releases this has always worked as expected.
Background: In our application we store the OpenTelemetry context of incoming REST calls and then use it as the parent in calls which are triggered by a schedule. The schedule does this by programmatically creating a new span with the original span's Context
as parent and then activates the span. When now within this span a RestEasy Reactive Client call is used to call some API, the corresponding CLIENT span ignores the programmatically created span and in fact has no parent span. In previous releases the CLIENT span would reference the programmatically created span as its parent.
The programmatic logic to create a span looks something like this (see org.acme.MyScheduler#scheduled()
in attached test for details):
Span span = tracer.spanBuilder("scheduler").setParent(context).startSpan();
try (Scope ignore = context.makeCurrent(); Scope ignore1 = span.makeCurrent()) {
response = client.hello(addressee);
span.end();
}
Expected behavior
Programmatic creation of spans should work as it did in previous Quarkus releases.
Actual behavior
See description.
Here are the spans (serialized JSON) created by the attached example project:
{
"resource": {
"attributes": [
{
"key": "service.name",
"value": {
"stringValue": "code-with-quarkus"
}
},
{
"key": "service.version",
"value": {
"stringValue": "1.0.0-SNAPSHOT"
}
},
{
"key": "telemetry.sdk.language",
"value": {
"stringValue": "java"
}
},
{
"key": "telemetry.sdk.name",
"value": {
"stringValue": "opentelemetry"
}
},
{
"key": "telemetry.sdk.version",
"value": {
"stringValue": "1.17.0"
}
},
{
"key": "webengine.name",
"value": {
"stringValue": "Quarkus"
}
},
{
"key": "webengine.version",
"value": {
"stringValue": "2.13.0.CR1"
}
}
]
},
"scopeSpans": [
{
"scope": {
"name": "io.quarkus.opentelemetry"
},
"spans": [
{
"traceId": "62ffe5f3278ee2dbefabc499f50037e2",
"spanId": "5f78270f084c18c7",
"parentSpanId": "a49e8f9f5b8e4d77",
"name": "/hello",
"kind": "SPAN_KIND_SERVER",
"startTimeUnixNano": "1663341640147360000",
"endTimeUnixNano": "1663341640191058401",
"attributes": [
{
"key": "http.response_content_length",
"value": {
"intValue": "28"
}
},
{
"key": "http.route",
"value": {
"stringValue": "/hello"
}
},
{
"key": "http.user_agent",
"value": {
"stringValue": "Apache-HttpClient/4.5.13 (Java/17.0.4)"
}
},
{
"key": "http.flavor",
"value": {
"stringValue": "1.1"
}
},
{
"key": "http.target",
"value": {
"stringValue": "/hello"
}
},
{
"key": "http.status_code",
"value": {
"intValue": "200"
}
},
{
"key": "http.client_ip",
"value": {
"stringValue": "127.0.0.1"
}
},
{
"key": "http.scheme",
"value": {
"stringValue": "http"
}
},
{
"key": "http.host",
"value": {
"stringValue": "localhost:8081"
}
},
{
"key": "http.method",
"value": {
"stringValue": "GET"
}
}
],
"events": [],
"links": [],
"status": {}
},
{
"traceId": "3042707be1f4a888f1f52c1fd49f5520",
"spanId": "56cbdfd8da10b8b3",
"parentSpanId": "f2d8926f2e017182",
"name": "/hello/internal",
"kind": "SPAN_KIND_SERVER",
"startTimeUnixNano": "1663341641109469200",
"endTimeUnixNano": "1663341641115058800",
"attributes": [
{
"key": "http.response_content_length",
"value": {
"intValue": "11"
}
},
{
"key": "http.route",
"value": {
"stringValue": "/hello/internal"
}
},
{
"key": "http.user_agent",
"value": {
"stringValue": "Resteasy Reactive Client"
}
},
{
"key": "http.flavor",
"value": {
"stringValue": "1.1"
}
},
{
"key": "http.target",
"value": {
"stringValue": "/hello/internal"
}
},
{
"key": "http.status_code",
"value": {
"intValue": "200"
}
},
{
"key": "http.client_ip",
"value": {
"stringValue": "127.0.0.1"
}
},
{
"key": "http.scheme",
"value": {
"stringValue": "http"
}
},
{
"key": "http.host",
"value": {
"stringValue": "localhost:8081"
}
},
{
"key": "http.method",
"value": {
"stringValue": "GET"
}
},
{
"key": "http.request_content_length",
"value": {
"intValue": "5"
}
}
],
"events": [],
"links": [],
"status": {}
},
{
"traceId": "3042707be1f4a888f1f52c1fd49f5520",
"spanId": "f2d8926f2e017182",
"name": "HTTP GET",
"kind": "SPAN_KIND_CLIENT",
"startTimeUnixNano": "1663341641105501300",
"endTimeUnixNano": "1663341641119615600",
"attributes": [
{
"key": "http.response_content_length",
"value": {
"intValue": "11"
}
},
{
"key": "http.user_agent",
"value": {
"stringValue": "Resteasy Reactive Client"
}
},
{
"key": "http.status_code",
"value": {
"intValue": "200"
}
},
{
"key": "http.method",
"value": {
"stringValue": "GET"
}
},
{
"key": "http.url",
"value": {
"stringValue": "http://localhost:8081/hello/internal"
}
}
],
"events": [],
"links": [],
"status": {}
},
{
"traceId": "62ffe5f3278ee2dbefabc499f50037e2",
"spanId": "10611f1988f86ede",
"parentSpanId": "5f78270f084c18c7",
"name": "scheduler",
"kind": "SPAN_KIND_INTERNAL",
"startTimeUnixNano": "1663341641000663901",
"endTimeUnixNano": "1663341641123223601",
"attributes": [],
"events": [],
"links": [],
"status": {}
}
]
}
]
}
How to Reproduce?
Run mvn verify
in the attached test application: code-with-quarkus.zip
Output of uname -a
or ver
No response
Output of java -version
No response
GraalVM version (if different from Java)
No response
Quarkus version or git rev
2.13.0.CR1
Build tool (ie. output of mvnw --version
or gradlew --version
)
No response
Additional information
No response
/cc @brunobat, @radcortez
It would be cool if this could be investigated and fixed for the 2.13.0.Final release.
I agree, this seems like an annoying regression. @brunobat any chance you could take a look for 2.13.0.Final
?
Will do some research and see if it's possible.
The 2.13.0.Final
core artifacts will be built tomorrow (I can wait until noon but not much more) so either we have a fix today or this will have to wait for .1
.
Wondering if this could be related to the context work.
Most likely related with https://github.com/quarkusio/quarkus/pull/27443
/cc @mkouba @cescoffier for your awareness.
The rest client may not be able to find the parent span if it's not added to the duplicated context. Because it comes from a scheduled method, it may likely not run on one.
Note that I doubt that the previous behavior was correct under load. I can see a lot of reasons (with the previous implementation) where things would have been messed up.
Anyway, if the hypothesis from above is correct, we would need to know where is the OTEL context stored in this case (please do not tell me in a thread local of the event loop), and how the rest client can retrieve it. Then we would copy that context into the duplicated context of the client.
I'm not an expert in tracing and rest client but the use case seems weird to me. A @Scheduled
method is definitely not a good match (and definitely not an efficient solution) if you need to delay an execution of a client. Also @Scheduled
methods are executed on a blocking thread by defatult. You might try to use a non-blocking scheduled method instead.
In the reproducer, if I do that:
io.vertx.core.Context c = VertxContext.getOrCreateDuplicatedContext(vertx.getOrCreateContext());
c.executeBlocking(p -> {
Span span = tracer.spanBuilder("scheduler").setParent(context).startSpan();
try (Scope ignore = context.makeCurrent(); Scope ignore1 = span.makeCurrent()) {
response = client.hello(addressee);
span.end();
}
p.complete();
});
I now get 4 spans instead of 3. So, there might be a merge issue, too (@radcortez ).
PS: ^ is not a proper fix, it's just something I did to investigate.
I now get 4 spans instead of 3. So, there might be a merge issue, too (@radcortez ).
@cescoffier 4 spans is actually correct. I just double-checked with Quarkus 2.12.2. I should have verified that first. I will update the reproducer accordingly. Sorry about that!
I'm not an expert in tracing and rest client but the use case seems weird to me. A
@Scheduled
method is definitely not a good match (and definitely not an efficient solution) if you need to delay an execution of a client. Also@Scheduled
methods are executed on a blocking thread by defatult. You might try to use a non-blocking scheduled method instead.
Just as background: We have a Quarkus extension which implements some variation of the transactional outbox pattern: A transaction (typically in the context of a REST request) can call a method which will insert a row (with a reference to a Java class) into the database. So only if the transaction completes successfully, this row will be persisted to the database. The scheduled method will then process these rows and call a method on the class referenced by the database row. This method can then do anything. In our case it will typically call some other REST service.
Thanks @knutwannheden !
Ok, so my hypothesis from above is correct. I need @brunobat and @radcortez brains to better understand where we could fix that case.
The workaround from above, even if ugly, would work.
Scheduled some time for later, just in case.
The workaround from above, even if ugly, would work.
Workarounds often have this characteristic :-) But thanks a lot for investigating and this workaround. We will certainly be able to wait for a proper solution in a patch release.
@knutwannheden We are going to implement a fix for the Otel context that starts on a Scheduler thread. However, I would like to point that storing the parent context of an already served request for later usage as parent of spans created by async tasks, should not be done. I recommend that when you start the new spans on the scheduler that you set a LINK to the original rest request. See: https://opentelemetry.io/docs/concepts/signals/traces/#span-links
@brunobat That is great news. Under normal circumstances the delayed operation should be picked up in a very timely manner, but we have indeed also considered creating linked spans. We will look into this again.
Further, I believe Kafka would by default do the same for incoming messages (i.e. create child spans), even though that could also happen a lot later. But that is of course another topic.
In fact I think we have an on-success hook (forgot what it is called right now) with which we immediately attempt to execute the registered operation. While still using OpenTracing we created a FOLLOWS_FROM span (part of the same trace). This concept doesn't exist in OpenTelemetry, but we should again look into linked spans.