opentelemetry-demo icon indicating copy to clipboard operation
opentelemetry-demo copied to clipboard

Change AccountService from Go to DotNet (auto)

Open RassK opened this issue 10 months ago • 17 comments

Changes

Fixes #589

Changes AccountingService from Go to .NET service. Uses OpenTelemetry .NET Automatic Instrumentation.

NOTE: It's a first preview of demo app that is instrumented with OpenTelemetry .NET Automatic instrumentation. Some parts are still TBD. Please review and add comments about improvements and missing parts / docs.

Merge Requirements

For new features contributions please make sure you have completed the following essential items:

  • [x] CHANGELOG.md updated to document new feature additions
  • [x] Appropriate documentation updates in the docs
  • [ ] Appropriate Helm chart updates in the helm-charts

Maintainers will not merge until the above have been completed. If you're unsure which docs need to be changed ping the @open-telemetry/demo-approvers.

RassK avatar Apr 19 '24 11:04 RassK

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions[bot] avatar Apr 30 '24 03:04 github-actions[bot]

Example Jaeger traces: image image

CC: @lachmatt @Kielek Should parentless Kafka consume requests be filtered? Seems without additional context these just trash the output.

RassK avatar Apr 30 '24 12:04 RassK

After removing EOF

image

RassK avatar Apr 30 '24 13:04 RassK

@RassK Can you add a Chanlog entry for this? Also, we will need to update docs and the Helm chart for this. If you need help to prep a PR in those areas, please let us know.

Screenshot 2024-05-13 at 10 53 00 AM

puckpuck avatar Apr 30 '24 15:04 puckpuck

@puckpuck thanks for helping to guide the last steps missing. I have updated the changelog. Would be glad to accept some help with other missing parts as well. It's my first PR to this repo.

RassK avatar May 06 '24 08:05 RassK

LGTM, I think this can be merged as the PR to opentlemetry.io is ready. As I understand, helm charts should not be affected. We are changing just one type of service to another.

Kielek avatar May 10 '24 16:05 Kielek

We will need to bump up the memory for the service in the Helm chart, but nothing beyond that. We can merge this before having to do anything in Helm, since Helm is bound to a release and not nightlies.

I want to keep the label on this PR so we can track all items that need to be updated when doing our next release.

puckpuck avatar May 13 '24 14:05 puckpuck

When I run this branch, I notice clock skews with the accounting service, which causes issues with the trace waterfall. I'm not sure what this service does differently than our other .NET service (cart), but we don't see the same clock skew there. @RassK can you take a look at this?

puckpuck avatar May 13 '24 14:05 puckpuck

@lachmatt could you recheck that kafka instrumentation, I think that is the source of this issue.

RassK avatar May 14 '24 08:05 RassK

@puckpuck So consulted with @lachmatt and we reached conclusion that it is not the issue with clock skew.

In the .NET the Kafka span's instrumentation start time is the start of the polling. Previously we had issues with the amount of spans, so every poll (even an empty one) created a span. Now as we by default filter such spans, I could lower the polling max timeout to default (100ms), so accounting service's Kafka consume spans could be now <100ms, which can be still huge compared to some other services reporting processing time around 1ms. The default setting now renders the Jaeger UI in more understandable way... but we cannot make the UI look more beautiful just because of the messaging behaviour, it still has to reflect the actual logic.

I see it more like Jaeger could improve theirs's UI by using more dynamic time axle and collapse big portions like this (10s wait time).

Here is the improved look > image

RassK avatar May 15 '24 09:05 RassK

@RassK wouldn't that be better handled in a linked span, instead of a child span?

We did that for fraudDetectionService.

julianocosta89 avatar May 15 '24 10:05 julianocosta89

@julianocosta89 that's unfortunately not configurable in auto instrumentation.

RassK avatar May 15 '24 10:05 RassK

Something is wrong with the .NET SDK here. No other SDKs start the span when you start polling.

puckpuck avatar May 15 '24 11:05 puckpuck

@puckpuck This is a behavior of Kafka client instrumentation which is part of OTel .NET Autoinstrumentation. Semantic conventions for messaging recommend to create receive spans for pull-based scenarios. Most of the existing instrumentations for Kafka client are creating process spans, but these were problematic, as explained in OTEP, and in PR.

lachmatt avatar May 15 '24 11:05 lachmatt

We should investigate doing this so the trace doesn't look like accounting service is starting before everything else, which looks like this should be done with a process consumer span. In its current state, this makes the consumption of the PlaceOrder trace very hard to do and will only raise questions from users about why the accounting service started before the request itself.

puckpuck avatar May 15 '24 16:05 puckpuck

@RassK wouldn't that be better handled in a linked span, instead of a child span?

Creation context is used as a parent, in addition to being added as a link. This scenario is described in OTEP and was discussed when working on the instrumentation.

We should investigate doing this so the trace doesn't look like accounting service is starting before everything else, which looks like this should be done with a process consumer span. In its current state, this makes the consumption of the PlaceOrder trace very hard to do and will only raise questions from users about why the accounting service started before the request itself.

@pyohannes What would be the recommendation from Messaging SIG side in such scenario?

lachmatt avatar May 16 '24 07:05 lachmatt

Creation context is used as a parent, in addition to being added as a link. This scenario is described in OTEP and was discussed when working on the https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/pull/3055#discussion_r1399074286.

This part is correct: you can use the creation context both as parent and link in "Receive" or "Process" operations.

The problem lies more in the duration of the "Receive" operation, which also includes the time spent in polling (which technically isn't part of this trace). I see you achieve this by starting a span only after a message was finally received, but by explicitly setting the start time of the span to the start of the polling operation.

I'll put this on the agenda for discussion in this week's messaging SIG meeting. While what is happening here doesn't really violate the semantic conventions, I think the "Receive" duration should only start after a message was received from the broker. This would result in more meaningful traces.

I submitted https://github.com/open-telemetry/semantic-conventions/issues/1050 to track related discussions.

pyohannes avatar May 20 '24 11:05 pyohannes

This was discussed in the messaging workgroup:

  • The duration of "Receive" spans should include the time spent for polling.
  • "Receive" spans by default should link to the creation context, but shouldn't use it as parent.

See a more detailed summary here.

pyohannes avatar May 27 '24 10:05 pyohannes

I'll update the solution when a newer version of otel auto is available.

RassK avatar Jun 03 '24 09:06 RassK

@RassK should we keep this PR as a Draft then?

julianocosta89 avatar Jun 04 '24 08:06 julianocosta89

@RassK should we keep this PR as a Draft then?

Sure, I have no preferences there.

RassK avatar Jun 04 '24 08:06 RassK

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions[bot] avatar Jun 12 '24 03:06 github-actions[bot]

Updated solution (OTel Auto 1.7.0) > image

RassK avatar Jun 19 '24 12:06 RassK

Hey @RassK it looks good, before approving would you mind explaining those log entries?

Accounting service started
[CORECLR_ENABLE_PROFILING, 1]
[CORECLR_PROFILER, {918728DD-259F-4A6A-AC2B-B85E1B658318}]
[CORECLR_PROFILER_PATH, /app/OpenTelemetry.AutoInstrumentation.Native.so]
[DOTNET_RUNNING_IN_CONTAINER, true]
[DOTNET_STARTUP_HOOKS, /app/OpenTelemetry.AutoInstrumentation.StartupHook.dll]

That's a quick startup diagnostics (by AccountingService.Helpers that filters relevant environment variables). Due automatic instrumentation is configured mostly by envs, then this could be helpful for a quick diagnosis.

Ideally a built in diagnostics (OTEL_LOG_LEVEL=debug) should do the similar but instead logs to the file.

RassK avatar Jun 19 '24 22:06 RassK

I've updated the tracetest to reflect the change, hope you don't mind. 😅

This should be good to go, I'll just wait to see if any other maintainer/approver has something to say.

Thanks for taking care of it 🤩

julianocosta89 avatar Jun 20 '24 08:06 julianocosta89

Just need to increase mem limit in the helm chart left? Can anyone help / explain where, I have had 0 connections to helm charts so far 😬

RassK avatar Jun 20 '24 09:06 RassK

the Helm chart will not be updated till next version release.

If you want to update it, you can do the memory limit here: https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-demo/values.yaml#L165

Then you need to bump a minor version here: https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-demo/Chart.yaml#L4

And last you need to run the: make generate-examples CHARTS=opentelemetry-demo.

You could also sync with @rogercoll and add it to his opened PR: https://github.com/open-telemetry/opentelemetry-helm-charts/pull/1232

Whenever you add it, you will still need to run the make generate command above.

julianocosta89 avatar Jun 27 '24 14:06 julianocosta89

@puckpuck, @mviitane are you good with this one being merged?

julianocosta89 avatar Jun 27 '24 14:06 julianocosta89

I tested again, and all looks fine using a linked span here instead. I will make another PR so the frauddetectionservice goes back to creating a child span instead of a linked span. This will allow us to showcase both examples with a message queue.

puckpuck avatar Jun 27 '24 20:06 puckpuck