opentelemetry-python
opentelemetry-python copied to clipboard
Timeout and retry of exporters is broken
Describe your environment
OS: Ubuntu 24.04 Python version: Python 3.12 SDK version: 1.25.1 API version: 1.25.1
What happened?
I am piloting the opentelemetry Python implementation in order to integrate it in several Python applications. During my experiments I observed major issues when a non reachable OTLP target is configured. During shutdown of the instrumented application, it seems to hang forever. Even if one might see this as edge case, to me this is really a blocker to start using the library as it affects the instrumented application in a quite negative way.
I tried to fix this locally, then researched here and came across many other reports (e.g #2663, #2402, #3309) and PRs (e.g. #3387, #3764). I read into the code, bugs and PRS and, tried to understand everything, including the history.
I am now writing to open the discussion how to approach this. I'll start with a rough conceptual proposal:
- The exporter uses
self._timeout(based onOTEL_EXPORTER_OTLP_TIMEOUT, defaults to 10 sec) as maximum time to spend on an export. - The exponential backoff retry is kept, but stays within the boundaries of
self._timeout.
On a more technical level:
- At the beginning of
OTLPExporterMixin._exporta deadline is calculated which is then the base for the further execution of this function. - The time needed for each
self._client.Exportcall is subtracted from the deadline. - As long as the deadline is not reached, we do retries with growing waits in between.
- In case
shutdownis called, the signal is stored inself._shutdown. - Now, once the shutdown signal is set, we can either continue the export until it's deadline, incl. additional retries, and then shutdown or interrupt the export the next time
_exportgets control over it (when it returns fromself._client.Exportor we interrupt the retry sleep). I tend to the 2nd option. If there is no agreement on how to proceed with this, we could also make it configurable to give users more control depending on their use case.
This way, we would have a clear maximum time to shutdown. What do you think?
I also propose to change the semantic of shutdown. Instead of waiting for getting acceptance to set the shutdown signal and then setting the shutdown signal, I'd rather simplify it to just "set the shutdown signal". This would return immediately and not interrupt the export. It's totally up to _export to pick up the signal as needed. The caller of the shutdown function would have to join the thread anyways to be sure the export thread correctly shut down. Do you think this is acceptable?
Steps to Reproduce
- Configure OTLP target
- Export a span
- Try to stop the application
Expected Result
The shutdown of my processes should not be influenced in an unexpected way. As a user of the library, ideally I get a clear understanding of the mechanics and control over the timeouts to make my own decisions depending on my use case.
Actual Result
A hanging application which would not shut down in a few seconds as before.
Additional context
No response
Would you like to implement a fix?
Yes
We failed at reviewing https://github.com/open-telemetry/opentelemetry-python/pull/3764 because it's huge, if you think the code is fine cutting it in more smaller pieces could be a path to move forward. If not still please still try to work in smaller batches.
This change looks like the most promising of the ones I've seen so far. Of course, this is far too big a chunk for readers to understand easily. I also haven't had a closer look at the commit yet to understand if it fits my suggestion above. I will do that later.
Regardless of the PR, does my approach above sound like a viable path or are there any concerns?
@LarsMichelsen I think that assuring the retries should happen inside the OTEL_EXPORTER_OTLP_TIMEOUT is correct so I agree with the conceptual proposal.
I took a deep look into #3764, tried to separate the individual changes and bring them into a meaningful order. If you are interested, you can find the current state of my try in https://github.com/LarsMichelsen/opentelemetry-python/tree/exporter_timeout .
While I think the approach of the PR in general is good, I changed some aspects of the change. For example, I don't agree with the change picking the lowest of the given timeouts (see here).
Having the change split into multiple steps, it's still a big undertaking. So I guess a reviewer will have to take some time to get into it. I hope that someone will be found who feels up to it.
I'd also be fine applying the changes separately to get the set of pending changes smaller step by step. Don't know how you generally deal with such changes. Just let me know.
Overall, I must say that I had a hard time with several tests. I think it would help to get rid of mocks and patches and make the code more easily testable, for example by using more dependency injection. However, I preferred to hold back at this point, as the changes are already extensive enough.
Specifically I am currently facing issues on Windows test runs (https://github.com/LarsMichelsen/opentelemetry-python/actions/runs/10451069489). To me it is not clear whether this a known general weakness or my changes broke something. Any guidance on this?
In my organization this issue raises questions whether the library is production ready. So I see this as something that in the some or the other way needs to be solved sooner than later. If we can not solve it, it may even prevent adoption in our stack. I wonder if this is not that relevant to other environments and how others deal with the current behavior?
@LarsMichelsen
Thanks for looking into this.
I'd also be fine applying the changes separately to get the set of pending changes smaller step by step.
If you are willing to take a stab at this, it would be greatly appreciated. We have had a lack of contributors/approvers who are knowledgeable enough around this topic so anything that can make the PR more readable and easier to comprehend would be most optimal. Also, if you have any design/architectural decisions you would like to run by the community to get more eyes on, feel free to join the weekly Python SIG (https://calendar.google.com/calendar/u/0/embed?src=c_2bf73e3b6b530da4babd444e72b76a6ad893a5c3f43cf40467abc7a9a897f977@group.calendar.google.com) and bring up this topic. Feel free to add your topic to the agenda. We meet every Thursday 9am PST.
If you are willing to take a stab at this, it would be greatly appreciated. We have had a lack of contributors/approvers who are knowledgeable enough around this topic so anything that can make the PR more readable and easier to comprehend would be most optimal.
Have you had a look at the first link I provided? It points to a branch where I split the changes. Here is the list of curent changes: https://github.com/open-telemetry/opentelemetry-python/compare/main...LarsMichelsen:opentelemetry-python:exporter_timeout
In principle, is this in a form which could be acceptable or do you see any obvious issues?
If not, I'd try to iron out the windows related issues reported by the CI and send a PR.
I'd try to iron out the windows related issues reported by the CI and send a PR.
This would probably be preferred and would get more eyes on.
Unfortunately I only get the error reported by the Windows jobs executed on the Github CI. Having no Windows system at hand right now, I am having a hard time figuring out whats going on. I'll have to get my hands on some Windows system to track this down. Impossible to do it via the CI, the turnaround is simply too long.
Got my hands on a Windows system and debugged my way through it. Seems I had too tight timeouts set so that it could not finish successfully in all environments.
I just created the pull request #4183. Now I hope that someone has the time and patience to review these changes.
What is the current state of this issue and the corresponding PR? As far as I can see, a maintainer is required to review the PR?
I tried my best, took over a monolithic PR of some other contributor, tried to break it down in smaller chunks, which should make the changes easier to understand. I did this hoping that we could get some reviewer on board to get the changes merged step by step. Iirc I tried to get some guidance on how to get this merged efficiently, but didn't get any substantial support. Maybe I tried it the wrong way.
I don't have the impression that this problem is being taken seriously. I don't really understand why, because such behavior has a direct impact on the reliability of the instrumented services. I also wonder why more users don't stumble across this problem.
Last time I had a good moment to look into this the code base already changed quite a bit. This means the whole commit chain would have to be updated again. Or the changes need to be uploaded for review in even smaller chunks. I don't know. It's a bit frustrating.
For our services I can say: It is quite possible that we will have to remove the library from our service again. It would be a great pity, as the traces are very useful. We will see.
To recap the context from OP:
I'll start with a rough conceptual proposal:
- The exporter uses
self._timeout(based onOTEL_EXPORTER_OTLP_TIMEOUT, defaults to 10 sec) as maximum time to spend on an export.- The exponential backoff retry is kept, but stays within the boundaries of
self._timeout.
Expected Result
The shutdown of my processes should not be influenced in an unexpected way. As a user of the library, ideally I get a clear understanding of the mechanics and control over the timeouts to make my own decisions depending on my use case.
Actual Result
A hanging application which would not shut down in a few seconds as before.
@DylanRussell has made some great improvements here. Dylan, do you think this issue is sufficiently addressed or is there anything outstanding from this issue that's not tracked elsewhere?
Yeah I think it can be closed, we basically implemented what was proposed here. Timeout is now inclusive of the retries for example
Closing this now. See https://github.com/open-telemetry/opentelemetry-python/pull/4638 and https://github.com/open-telemetry/opentelemetry-python/pull/4564 for details
Thanks a lot for working on this. I gave it a quick first try today using the current release and it looks quite promising.