quarkus icon indicating copy to clipboard operation
quarkus copied to clipboard

Regression: Programmatic OpenTelemetry spans get ignored

Open knutwannheden opened this issue 2 years ago • 14 comments

Describe the bug

NOTE: This is a regression in Quarkus 2.13.0.CR1. In previous releases this has always worked as expected.

Background: In our application we store the OpenTelemetry context of incoming REST calls and then use it as the parent in calls which are triggered by a schedule. The schedule does this by programmatically creating a new span with the original span's Context as parent and then activates the span. When now within this span a RestEasy Reactive Client call is used to call some API, the corresponding CLIENT span ignores the programmatically created span and in fact has no parent span. In previous releases the CLIENT span would reference the programmatically created span as its parent.

The programmatic logic to create a span looks something like this (see org.acme.MyScheduler#scheduled() in attached test for details):

Span span = tracer.spanBuilder("scheduler").setParent(context).startSpan();
try (Scope ignore = context.makeCurrent(); Scope ignore1 = span.makeCurrent()) {
    response = client.hello(addressee);
    span.end();
}

Expected behavior

Programmatic creation of spans should work as it did in previous Quarkus releases.

Actual behavior

See description.

Here are the spans (serialized JSON) created by the attached example project:

{
  "resource": {
    "attributes": [
      {
        "key": "service.name",
        "value": {
          "stringValue": "code-with-quarkus"
        }
      },
      {
        "key": "service.version",
        "value": {
          "stringValue": "1.0.0-SNAPSHOT"
        }
      },
      {
        "key": "telemetry.sdk.language",
        "value": {
          "stringValue": "java"
        }
      },
      {
        "key": "telemetry.sdk.name",
        "value": {
          "stringValue": "opentelemetry"
        }
      },
      {
        "key": "telemetry.sdk.version",
        "value": {
          "stringValue": "1.17.0"
        }
      },
      {
        "key": "webengine.name",
        "value": {
          "stringValue": "Quarkus"
        }
      },
      {
        "key": "webengine.version",
        "value": {
          "stringValue": "2.13.0.CR1"
        }
      }
    ]
  },
  "scopeSpans": [
    {
      "scope": {
        "name": "io.quarkus.opentelemetry"
      },
      "spans": [
        {
          "traceId": "62ffe5f3278ee2dbefabc499f50037e2",
          "spanId": "5f78270f084c18c7",
          "parentSpanId": "a49e8f9f5b8e4d77",
          "name": "/hello",
          "kind": "SPAN_KIND_SERVER",
          "startTimeUnixNano": "1663341640147360000",
          "endTimeUnixNano": "1663341640191058401",
          "attributes": [
            {
              "key": "http.response_content_length",
              "value": {
                "intValue": "28"
              }
            },
            {
              "key": "http.route",
              "value": {
                "stringValue": "/hello"
              }
            },
            {
              "key": "http.user_agent",
              "value": {
                "stringValue": "Apache-HttpClient/4.5.13 (Java/17.0.4)"
              }
            },
            {
              "key": "http.flavor",
              "value": {
                "stringValue": "1.1"
              }
            },
            {
              "key": "http.target",
              "value": {
                "stringValue": "/hello"
              }
            },
            {
              "key": "http.status_code",
              "value": {
                "intValue": "200"
              }
            },
            {
              "key": "http.client_ip",
              "value": {
                "stringValue": "127.0.0.1"
              }
            },
            {
              "key": "http.scheme",
              "value": {
                "stringValue": "http"
              }
            },
            {
              "key": "http.host",
              "value": {
                "stringValue": "localhost:8081"
              }
            },
            {
              "key": "http.method",
              "value": {
                "stringValue": "GET"
              }
            }
          ],
          "events": [],
          "links": [],
          "status": {}
        },
        {
          "traceId": "3042707be1f4a888f1f52c1fd49f5520",
          "spanId": "56cbdfd8da10b8b3",
          "parentSpanId": "f2d8926f2e017182",
          "name": "/hello/internal",
          "kind": "SPAN_KIND_SERVER",
          "startTimeUnixNano": "1663341641109469200",
          "endTimeUnixNano": "1663341641115058800",
          "attributes": [
            {
              "key": "http.response_content_length",
              "value": {
                "intValue": "11"
              }
            },
            {
              "key": "http.route",
              "value": {
                "stringValue": "/hello/internal"
              }
            },
            {
              "key": "http.user_agent",
              "value": {
                "stringValue": "Resteasy Reactive Client"
              }
            },
            {
              "key": "http.flavor",
              "value": {
                "stringValue": "1.1"
              }
            },
            {
              "key": "http.target",
              "value": {
                "stringValue": "/hello/internal"
              }
            },
            {
              "key": "http.status_code",
              "value": {
                "intValue": "200"
              }
            },
            {
              "key": "http.client_ip",
              "value": {
                "stringValue": "127.0.0.1"
              }
            },
            {
              "key": "http.scheme",
              "value": {
                "stringValue": "http"
              }
            },
            {
              "key": "http.host",
              "value": {
                "stringValue": "localhost:8081"
              }
            },
            {
              "key": "http.method",
              "value": {
                "stringValue": "GET"
              }
            },
            {
              "key": "http.request_content_length",
              "value": {
                "intValue": "5"
              }
            }
          ],
          "events": [],
          "links": [],
          "status": {}
        },
        {
          "traceId": "3042707be1f4a888f1f52c1fd49f5520",
          "spanId": "f2d8926f2e017182",
          "name": "HTTP GET",
          "kind": "SPAN_KIND_CLIENT",
          "startTimeUnixNano": "1663341641105501300",
          "endTimeUnixNano": "1663341641119615600",
          "attributes": [
            {
              "key": "http.response_content_length",
              "value": {
                "intValue": "11"
              }
            },
            {
              "key": "http.user_agent",
              "value": {
                "stringValue": "Resteasy Reactive Client"
              }
            },
            {
              "key": "http.status_code",
              "value": {
                "intValue": "200"
              }
            },
            {
              "key": "http.method",
              "value": {
                "stringValue": "GET"
              }
            },
            {
              "key": "http.url",
              "value": {
                "stringValue": "http://localhost:8081/hello/internal"
              }
            }
          ],
          "events": [],
          "links": [],
          "status": {}
        },
        {
          "traceId": "62ffe5f3278ee2dbefabc499f50037e2",
          "spanId": "10611f1988f86ede",
          "parentSpanId": "5f78270f084c18c7",
          "name": "scheduler",
          "kind": "SPAN_KIND_INTERNAL",
          "startTimeUnixNano": "1663341641000663901",
          "endTimeUnixNano": "1663341641123223601",
          "attributes": [],
          "events": [],
          "links": [],
          "status": {}
        }
      ]
    }
  ]
}

How to Reproduce?

Run mvn verify in the attached test application: code-with-quarkus.zip

Output of uname -a or ver

No response

Output of java -version

No response

GraalVM version (if different from Java)

No response

Quarkus version or git rev

2.13.0.CR1

Build tool (ie. output of mvnw --version or gradlew --version)

No response

Additional information

No response

knutwannheden avatar Sep 16 '22 15:09 knutwannheden

/cc @brunobat, @radcortez

quarkus-bot[bot] avatar Sep 16 '22 15:09 quarkus-bot[bot]

It would be cool if this could be investigated and fixed for the 2.13.0.Final release.

knutwannheden avatar Sep 16 '22 15:09 knutwannheden

I agree, this seems like an annoying regression. @brunobat any chance you could take a look for 2.13.0.Final?

geoand avatar Sep 20 '22 05:09 geoand

Will do some research and see if it's possible.

brunobat avatar Sep 20 '22 08:09 brunobat

The 2.13.0.Final core artifacts will be built tomorrow (I can wait until noon but not much more) so either we have a fix today or this will have to wait for .1.

Wondering if this could be related to the context work.

gsmet avatar Sep 20 '22 08:09 gsmet

Most likely related with https://github.com/quarkusio/quarkus/pull/27443

radcortez avatar Sep 20 '22 08:09 radcortez

/cc @mkouba @cescoffier for your awareness.

gsmet avatar Sep 20 '22 09:09 gsmet

The rest client may not be able to find the parent span if it's not added to the duplicated context. Because it comes from a scheduled method, it may likely not run on one.

Note that I doubt that the previous behavior was correct under load. I can see a lot of reasons (with the previous implementation) where things would have been messed up.

Anyway, if the hypothesis from above is correct, we would need to know where is the OTEL context stored in this case (please do not tell me in a thread local of the event loop), and how the rest client can retrieve it. Then we would copy that context into the duplicated context of the client.

cescoffier avatar Sep 20 '22 09:09 cescoffier

I'm not an expert in tracing and rest client but the use case seems weird to me. A @Scheduled method is definitely not a good match (and definitely not an efficient solution) if you need to delay an execution of a client. Also @Scheduled methods are executed on a blocking thread by defatult. You might try to use a non-blocking scheduled method instead.

mkouba avatar Sep 20 '22 09:09 mkouba

In the reproducer, if I do that:

                io.vertx.core.Context c = VertxContext.getOrCreateDuplicatedContext(vertx.getOrCreateContext());
                c.executeBlocking(p -> {
                    Span span = tracer.spanBuilder("scheduler").setParent(context).startSpan();
                    try (Scope ignore = context.makeCurrent(); Scope ignore1 = span.makeCurrent()) {
                        response = client.hello(addressee);
                        span.end();
                    }
                    p.complete();

                });

I now get 4 spans instead of 3. So, there might be a merge issue, too (@radcortez ).

PS: ^ is not a proper fix, it's just something I did to investigate.

cescoffier avatar Sep 20 '22 09:09 cescoffier

I now get 4 spans instead of 3. So, there might be a merge issue, too (@radcortez ).

@cescoffier 4 spans is actually correct. I just double-checked with Quarkus 2.12.2. I should have verified that first. I will update the reproducer accordingly. Sorry about that!

I'm not an expert in tracing and rest client but the use case seems weird to me. A @Scheduled method is definitely not a good match (and definitely not an efficient solution) if you need to delay an execution of a client. Also @Scheduled methods are executed on a blocking thread by defatult. You might try to use a non-blocking scheduled method instead.

Just as background: We have a Quarkus extension which implements some variation of the transactional outbox pattern: A transaction (typically in the context of a REST request) can call a method which will insert a row (with a reference to a Java class) into the database. So only if the transaction completes successfully, this row will be persisted to the database. The scheduled method will then process these rows and call a method on the class referenced by the database row. This method can then do anything. In our case it will typically call some other REST service.

knutwannheden avatar Sep 20 '22 11:09 knutwannheden

Thanks @knutwannheden !

Ok, so my hypothesis from above is correct. I need @brunobat and @radcortez brains to better understand where we could fix that case.

The workaround from above, even if ugly, would work.

cescoffier avatar Sep 20 '22 12:09 cescoffier

Scheduled some time for later, just in case.

brunobat avatar Sep 20 '22 12:09 brunobat

The workaround from above, even if ugly, would work.

Workarounds often have this characteristic :-) But thanks a lot for investigating and this workaround. We will certainly be able to wait for a proper solution in a patch release.

knutwannheden avatar Sep 20 '22 12:09 knutwannheden

@knutwannheden We are going to implement a fix for the Otel context that starts on a Scheduler thread. However, I would like to point that storing the parent context of an already served request for later usage as parent of spans created by async tasks, should not be done. I recommend that when you start the new spans on the scheduler that you set a LINK to the original rest request. See: https://opentelemetry.io/docs/concepts/signals/traces/#span-links

brunobat avatar Sep 23 '22 08:09 brunobat

@brunobat That is great news. Under normal circumstances the delayed operation should be picked up in a very timely manner, but we have indeed also considered creating linked spans. We will look into this again.

Further, I believe Kafka would by default do the same for incoming messages (i.e. create child spans), even though that could also happen a lot later. But that is of course another topic.

knutwannheden avatar Sep 23 '22 10:09 knutwannheden

In fact I think we have an on-success hook (forgot what it is called right now) with which we immediately attempt to execute the registered operation. While still using OpenTracing we created a FOLLOWS_FROM span (part of the same trace). This concept doesn't exist in OpenTelemetry, but we should again look into linked spans.

knutwannheden avatar Sep 23 '22 10:09 knutwannheden