opentelemetry-demo
opentelemetry-demo copied to clipboard
Change AccountService from Go to DotNet (auto)
Changes
Fixes #589
Changes AccountingService from Go to .NET service. Uses OpenTelemetry .NET Automatic Instrumentation.
NOTE: It's a first preview of demo app that is instrumented with OpenTelemetry .NET Automatic instrumentation. Some parts are still TBD. Please review and add comments about improvements and missing parts / docs.
Merge Requirements
For new features contributions please make sure you have completed the following essential items:
- [x]
CHANGELOG.md
updated to document new feature additions - [x] Appropriate documentation updates in the docs
- [ ] Appropriate Helm chart updates in the helm-charts
Maintainers will not merge until the above have been completed. If you're unsure which docs need to be changed ping the @open-telemetry/demo-approvers.
This PR was marked stale due to lack of activity. It will be closed in 7 days.
Example Jaeger traces:
CC: @lachmatt @Kielek Should parentless Kafka consume requests be filtered? Seems without additional context these just trash the output.
After removing EOF
@RassK Can you add a Chanlog entry for this? Also, we will need to update docs and the Helm chart for this. If you need help to prep a PR in those areas, please let us know.
@puckpuck thanks for helping to guide the last steps missing. I have updated the changelog. Would be glad to accept some help with other missing parts as well. It's my first PR to this repo.
LGTM, I think this can be merged as the PR to opentlemetry.io is ready. As I understand, helm charts should not be affected. We are changing just one type of service to another.
We will need to bump up the memory for the service in the Helm chart, but nothing beyond that. We can merge this before having to do anything in Helm, since Helm is bound to a release and not nightlies.
I want to keep the label on this PR so we can track all items that need to be updated when doing our next release.
When I run this branch, I notice clock skews with the accounting service, which causes issues with the trace waterfall. I'm not sure what this service does differently than our other .NET service (cart), but we don't see the same clock skew there. @RassK can you take a look at this?
@lachmatt could you recheck that kafka instrumentation, I think that is the source of this issue.
@puckpuck So consulted with @lachmatt and we reached conclusion that it is not the issue with clock skew.
In the .NET the Kafka span's instrumentation start time is the start of the polling. Previously we had issues with the amount of spans, so every poll (even an empty one) created a span. Now as we by default filter such spans, I could lower the polling max timeout to default (100ms), so accounting service's Kafka consume spans could be now <100ms, which can be still huge compared to some other services reporting processing time around 1ms. The default setting now renders the Jaeger UI in more understandable way... but we cannot make the UI look more beautiful just because of the messaging behaviour, it still has to reflect the actual logic.
I see it more like Jaeger could improve theirs's UI by using more dynamic time axle and collapse big portions like this (10s wait time).
Here is the improved look >
@RassK wouldn't that be better handled in a linked span, instead of a child span?
We did that for fraudDetectionService
.
@julianocosta89 that's unfortunately not configurable in auto instrumentation.
Something is wrong with the .NET SDK here. No other SDKs start the span when you start polling.
@puckpuck
This is a behavior of Kafka client instrumentation which is part of OTel .NET Autoinstrumentation.
Semantic conventions for messaging recommend to create receive
spans for pull-based scenarios.
Most of the existing instrumentations for Kafka client are creating process
spans, but these were problematic, as explained in OTEP, and in PR.
We should investigate doing this so the trace doesn't look like accounting service is starting before everything else, which looks like this should be done with a process consumer span. In its current state, this makes the consumption of the PlaceOrder trace very hard to do and will only raise questions from users about why the accounting service started before the request itself.
@RassK wouldn't that be better handled in a linked span, instead of a child span?
Creation context is used as a parent, in addition to being added as a link. This scenario is described in OTEP and was discussed when working on the instrumentation.
We should investigate doing this so the trace doesn't look like accounting service is starting before everything else, which looks like this should be done with a process consumer span. In its current state, this makes the consumption of the PlaceOrder trace very hard to do and will only raise questions from users about why the accounting service started before the request itself.
@pyohannes What would be the recommendation from Messaging SIG side in such scenario?
Creation context is used as a parent, in addition to being added as a link. This scenario is described in OTEP and was discussed when working on the https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/pull/3055#discussion_r1399074286.
This part is correct: you can use the creation context both as parent and link in "Receive" or "Process" operations.
The problem lies more in the duration of the "Receive" operation, which also includes the time spent in polling (which technically isn't part of this trace). I see you achieve this by starting a span only after a message was finally received, but by explicitly setting the start time of the span to the start of the polling operation.
I'll put this on the agenda for discussion in this week's messaging SIG meeting. While what is happening here doesn't really violate the semantic conventions, I think the "Receive" duration should only start after a message was received from the broker. This would result in more meaningful traces.
I submitted https://github.com/open-telemetry/semantic-conventions/issues/1050 to track related discussions.
This was discussed in the messaging workgroup:
- The duration of "Receive" spans should include the time spent for polling.
- "Receive" spans by default should link to the creation context, but shouldn't use it as parent.
See a more detailed summary here.
I'll update the solution when a newer version of otel auto is available.
@RassK should we keep this PR as a Draft then?
@RassK should we keep this PR as a Draft then?
Sure, I have no preferences there.
This PR was marked stale due to lack of activity. It will be closed in 7 days.
Updated solution (OTel Auto 1.7.0) >
Hey @RassK it looks good, before approving would you mind explaining those log entries?
Accounting service started [CORECLR_ENABLE_PROFILING, 1] [CORECLR_PROFILER, {918728DD-259F-4A6A-AC2B-B85E1B658318}] [CORECLR_PROFILER_PATH, /app/OpenTelemetry.AutoInstrumentation.Native.so] [DOTNET_RUNNING_IN_CONTAINER, true] [DOTNET_STARTUP_HOOKS, /app/OpenTelemetry.AutoInstrumentation.StartupHook.dll]
That's a quick startup diagnostics (by AccountingService.Helpers
that filters relevant environment variables). Due automatic instrumentation is configured mostly by envs, then this could be helpful for a quick diagnosis.
Ideally a built in diagnostics (OTEL_LOG_LEVEL=debug) should do the similar but instead logs to the file.
I've updated the tracetest to reflect the change, hope you don't mind. 😅
This should be good to go, I'll just wait to see if any other maintainer/approver has something to say.
Thanks for taking care of it 🤩
Just need to increase mem limit in the helm chart left? Can anyone help / explain where, I have had 0 connections to helm charts so far 😬
the Helm chart will not be updated till next version release.
If you want to update it, you can do the memory limit here: https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-demo/values.yaml#L165
Then you need to bump a minor version here: https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-demo/Chart.yaml#L4
And last you need to run the: make generate-examples CHARTS=opentelemetry-demo
.
You could also sync with @rogercoll and add it to his opened PR: https://github.com/open-telemetry/opentelemetry-helm-charts/pull/1232
Whenever you add it, you will still need to run the make generate
command above.
@puckpuck, @mviitane are you good with this one being merged?
I tested again, and all looks fine using a linked span here instead. I will make another PR so the frauddetectionservice goes back to creating a child span instead of a linked span. This will allow us to showcase both examples with a message queue.