spring-cloud-sleuth icon indicating copy to clipboard operation
spring-cloud-sleuth copied to clipboard

Spans opened by reactive WebClient .retry() are not nested correctly

Open archie-swif opened this issue 7 years ago • 15 comments

Preface I have two Spring WebFlux apps with configured Sleuth and Jaeger server. One app has an /error functional endpoint that returns 500 error. Other has a /test functional endpoint that sends a request to the /error. I'm using a Spring reactive WebClient to send a http request. With a .retry(...) added to the web client mono chain.

Description App ONE sends a request to app TWO /error endpoint, gets a 500 error, and makes 3 more retry requests.

The problem is that TWO's spans opened by retry are not nested under corresponding ONE's web-client individual request spans. By the timeline you can see which TWO belongs to which ONE span.

jeager

Resources Code: https://github.com/archie-swif/sleuth-webclient-retry Run two instances of the application, one with 'one' spring profile, another with 'two'.

ONE's log: https://gist.github.com/archie-swif/2dfbf28c0837830cd75ad68c069079d8 TWO's log: https://gist.github.com/archie-swif/c0103c6e695c01d8c6786a6940335ea5

Trace JSON from Jaeger: https://gist.github.com/archie-swif/d2453c556272371fd5961a4bc243cccb

archie-swif avatar Nov 01 '18 21:11 archie-swif

using jaeger is unnecessary for an example like this and also makes it harder to troubleshoot. sleuth emits spans in zipkin v2 format then jaeger converts to something else. can you please post the actual json sent? easiest way to do this is run your same code against zipkin. the problem should not be backend specific this is just to remove unnecessary distractions

On Fri, 2 Nov 2018, 06:12 Artem Ryabokon, [email protected] wrote:

Preface I have two Spring WebFlux apps with configured Sleuth and Jaeger server. One app has an /error functional endpoint that returns 500 error. Other has a /test functional endpoint that sends a request to the /error. I'm using a Spring reactive WebClient to send a http request. With a .retry(...) added to the web client mono chain.

Description App ONE sends a request to app TWO /error endpoint, gets a 500 error, and makes 3 more retry requests.

The problem is that TWO's spans opened by retry are not nested under ONE's web-client individual request spans. They are nested under TWO's first error span. First TWO's span and TWO's spans without retry are nested correctly.

[image: jeager] https://user-images.githubusercontent.com/2158707/47879564-28b18180-ddde-11e8-85b9-ec245b270974.png

Resources Code: https://github.com/archie-swif/sleuth-webclient-retry Run two instances of the application, one with 'one' spring profile, another with 'two'.

ONE's log: https://gist.github.com/archie-swif/2dfbf28c0837830cd75ad68c069079d8 TWO's log: https://gist.github.com/archie-swif/c0103c6e695c01d8c6786a6940335ea5

Trace JSON from Jaeger: https://gist.github.com/archie-swif/d2453c556272371fd5961a4bc243cccb

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/spring-cloud/spring-cloud-sleuth/issues/1121, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD619RgEEvMjN7F_nMn3-pMSkTGU1eCks5uq2PJgaJpZM4YJSjm .

codefromthecrypt avatar Nov 02 '18 01:11 codefromthecrypt

@adriancole Sure, no problem.

Trace JSON: https://gist.github.com/archie-swif/81ea16a55ae9dccf3f2505257f2beac4 Logs: https://gist.github.com/archie-swif/2e98e07fb667ff391ed36e5b2402a211

That's how it looks in Zipkin. There is only a single TWO span for some reason, for the exact scenario. screen shot 2018-11-01 at 7 14 02 pm

archie-swif avatar Nov 02 '18 02:11 archie-swif

In the JSON I see that there are 4 get records for app TWO, with the same Span Id ddc5003cea7888a7

  {
    "traceId": "bbf04ad17fce973c",
    "parentId": "bbf04ad17fce973c",
    "id": "ddc5003cea7888a7",
    "kind": "SERVER",
    "name": "get",
    "timestamp": 1541124521469143,
    "duration": 3361,
    "localEndpoint": {
      "serviceName": "two",
      "ipv4": "172.31.99.80"
    },
    "tags": {
      "error": "500",
      "http.method": "GET",
      "http.path": "/error",
      "http.status_code": "500"
    },
    "shared": true
  },

For the app ONE, there is one get record with the same Span Id ddc5003cea7888a7 And three more get records with unique Span Ids.

screen shot 2018-11-01 at 9 51 03 pm

archie-swif avatar Nov 02 '18 04:11 archie-swif

I would expect get TWO records to be nested under get ONE's Span, this way:

screen shot 2018-11-01 at 10 27 11 pm

Expected JSON: https://gist.github.com/archie-swif/5170bfd5118cf0b66e128607aea0727b

Imported in Zipkin: screen shot 2018-11-01 at 10 47 30 pm

archie-swif avatar Nov 02 '18 05:11 archie-swif

Can you try with latest 2.1.0 snapshots?

marcingrzejszczak avatar Nov 06 '18 15:11 marcingrzejszczak

It looks better with 2.1.0.BUILD-SNAPSHOT spring-boot-starter-parent. But still has some problems. screen shot 2018-11-06 at 11 14 18 am

Trace JSON: https://gist.github.com/archie-swif/513bcf9cc97a87938f2491bca0d950c8 Logs: https://gist.github.com/archie-swif/10282e2039c232b9b2889028e06ffd5f

screen shot 2018-11-06 at 10 58 58 am

archie-swif avatar Nov 06 '18 19:11 archie-swif

So, 2.1.0 snapshots did not help with for this issue

archie-swif avatar Nov 06 '18 19:11 archie-swif

However I see that Span IDs in TWO's logs are now unique Let me take a closer look

archie-swif avatar Nov 06 '18 19:11 archie-swif

With 2.1.0 Snapshot, ID assignment has changed. Now each WebClient GET call on app ONE opens a span with unique id. And app TWO code is executed in the span with those id.

Still, I would expect a new span opened when app TWO code is executed. And this span be nested under app ONE's WebClient GET span. Please correct me if I'm wrong with this. Thank you!

screen shot 2018-11-06 at 1 45 26 pm

archie-swif avatar Nov 06 '18 21:11 archie-swif

Maybe I don't understand sth but when I look at the graph right now, that seems to be ok. Each call to two has a unique ID

image

Can you sketch how it should look like in your opinion?

marcingrzejszczak avatar Jan 06 '19 16:01 marcingrzejszczak

There was an expected picture bit earlier, with two “get” bars under each “async”. https://github.com/spring-cloud/spring-cloud-sleuth/issues/1121#issuecomment-435275230

Now Id is unique for each call, but is duplicated for “get” in scope of each individual request. This way one of the “get” block is lost on the chart.

archie-swif avatar Jan 06 '19 23:01 archie-swif

Hey @archie-swif can you check if the latest versions work fine for that matter?

marcingrzejszczak avatar Aug 05 '20 17:08 marcingrzejszczak

Hi @marcingrzejszczak , thanks for checking up!

I have tried it with Spring Boot 2.3.2.RELEASE, 2.4.0-SNAPSHOT, and latest zipkin image 2.21.5. Here is how it looks by default: Screen Shot 2020-08-05 at 16 17 10

I see that more spans of the service ONE are present in json, but they are not visible in zipkin UI. It looks like the issue is in non unique "id" values in span objects. Zipkins takes the one span with an unique id and ignores others with the same id.

If exported trace json is updated so that all ids are unique: https://github.com/archie-swif/sleuth-webclient-retry/commit/853173bae76492ae74fbbd8aaba4d4cc2741617a

Then it looks this way, no spans are lost: zipkin-expected

Thank you!

archie-swif avatar Aug 06 '20 00:08 archie-swif

I've bumped boot to 2.4.0 and Cloud to 2020.0.0-SNAPSHOT and I got the following trace

image

I guess it looks better? Still I don't see the controllers from one but async which is interesting... I guess this async is a retry attempt

marcingrzejszczak avatar Aug 06 '20 12:08 marcingrzejszczak

In 2.4.0, additional "async" span is being tracked, but ONE's "get" span is still ignored, even if it's present in json. After id fix it looks this way: Screen Shot 2020-08-06 at 11 09 57

archie-swif avatar Aug 06 '20 18:08 archie-swif

Please upgrade to Micrometer Tracing. Spring Cloud Sleuth is feature complete and out of OSS support.

marcingrzejszczak avatar Feb 09 '24 13:02 marcingrzejszczak