[Bug] Unable to run workflows with OpenTelemetry and ddtrace
What are you really trying to do?
I'm attempting to enable tracing on my worker using the TracingInterceptor and the tracer provided by ddtrace.
Describe the bug
A change upstream with the OpenTelemetry SDK seems to have caused an issue with Temporal's sandbox environment, causing an error stating os.environ.get isn't allowed in a workflow.
The relevant change on the OpenTelemetry side is here - if I run with a version prior to 1.29 then I no longer get an error.
Minimal Reproduction
A workflow implemented on a client with the TracingInterceptor attached and a valid OpenTelemetry SDK configured, in my case I'm using ddtrace with the DD_TRACE_OTEL_ENABLED environment variable set to true. From what I can tell though the call that isn't allowed in the sandbox is coming directly from the OpenTelemetry SDK and not the ddtrace patch.
Environment/Versions
- OS and processor: macOS Apple Silicon
- Temporal Version: SDK 1.9.0
- Running in Docker arm64
Additional context
You can see the stacktrace and error that's displayed on the Temporal UI here: https://gist.github.com/connected-bkiiskila/d11592cb86271b5f343a4d387097d418
Thanks! Hrmmm...
Looking at your stack trace, it appears at https://github.com/DataDog/dd-trace-py/blob/eddd51464e8c32db019e239c106317228b65667c/ddtrace/internal/opentelemetry/context.py#L24-L26 opentelemetry.baggage is and others are unfortunately imported at runtime which means you may be inadvertently re-importing things for every workflow run. You should be able to add import opentelemetry.baggage somewhere outside of your workflow (e.g. where you start the worker) to run the module initialization code that ddtrace is running lazily. This will also improve your performance. ddtrace (reasonably) assumes modules are cached and that imports are not rerun in other environments, but that's not the case with workflows.
Though I am a bit surprised our import opentelemetry.baggage.propagation doesn't imply this. You may also have to configure the sandbox restrictions to treat these OTel imports as pass through (https://github.com/temporalio/sdk-python?tab=readme-ov-file#passthrough-modules). It will require testing...
I'll give that extra import a shot and see if that works and report back
Hmmm, that doesn't seem to work either - only thing I'm having luck with is the downgrade to <1.29
For that passthrough documentation, would I just be flat out allowing os.environ.get or is there a way to allow specifically the call within the opentelemetry SDK?
For that passthrough documentation, would I just be flat out allowing os.environ.get or is there a way to allow specifically the call within the opentelemetry SDK?
The passthrough would be for passing through the imports, unrelated to the restrictions happening on os.environ.get. What's happening now, from your stack trace, is that from opentelemetry.baggage import get_all is being run at runtime in the workflow. We want that to have already been cached. You can probably just add this to the top of your workflow file:
with workflow.unsafe.imports_passed_through():
import opentelemetry.baggage
But this is untested. Even if we do allow os.environ.get through the sandbox, the bigger problem is that you are reimporting OTel libraries for every workflow run which you don't want to do.
Seems like that <1.29 fix was a fluke and works some of the time for whatever reason so I'm likely going to have to allow the import of os.environ.
We may also be able to look into it, though it seems to be ddtrace specific in that they are importing at runtime. I do think it is worth doing your best to prevent this import from being treated as a new import for every workflow run (since that means it runs all OTel bootstrap code every run and takes up extra memory).
Yeah I'd like to avoid it if possible, would you suggest creating a ticket on the ddtrace side and referencing this?
ddtrace is not really doing anything wrong as far as a normal Python library goes. https://github.com/DataDog/dd-trace-py/blob/eddd51464e8c32db019e239c106317228b65667c/ddtrace/internal/opentelemetry/context.py#L24-L26 is just a problem in Temporal workflows. But we need to get those imports preloaded and passed through. You may be able to do something like:
import opentelemetry.baggage
import opentelemetry.trace
my_worker = Worker(
...,
workflow_runner=SandboxedWorkflowRunner(
restrictions=SandboxRestrictions.default.with_passthrough_modules("opentelemetry", "ddtrace")
)
)
when you create your worker.
I have been seeing similar errors in a deployment I have where datadog sidecars were enabled.
Is this something we would need to patch into all our application workers
with workflow.unsafe.imports_passed_through():
...
OR is there something that could/should be done at the python-sdk level to solve this for all users?
But we need to get those imports preloaded
@cretz is this something you guys plan on implementing? If you give some clear guidance on what needs done i could possibly draft a PR for it
I don't think we have plans to support ddtrace specifically at this time. I do think you need to pass through the import of ddtrace if it is used in workflow code, including in workflow interceptors which are just workflow code. There may be other sandbox avoidance things that need to happen depending on the details of what ddtrace tries to do.