opencensus-python
opencensus-python copied to clipboard
Python script randomly gets stuck after it finishes executing main program
Describe your environment. Python version 3.7.11 on Linux, opencensus-ext-azure v1.1.1, AKS v1.22.6.
Steps to reproduce. I have a python scripts which is using opencensus-ext-azure package to export logs to App Insight. It is running on AKS as a job. From time to time, the pod finishes executing the main program but it gets stuck and doesn't exits. I ran the script with "Python -m trace --trace ... " and I saw that it gets stuck in some code coming from the opencensus-ext-azure package. I cannot reproduce this on demand but I attached Python execution tracing.
What is the expected behavior? Script exits after it finishes executing the main program.
What is the actual behavior? Script hangs after it finishes executing the main program.
Would you happen to be using the metrics exporter?
Would you happen to be using the metrics exporter?
I am using AzureLogHandler but maybe that is calling the metrics exporter at some stage?
Opencensus-ext-azure collects network related metrics automatically in the background, which is why in your stack track you see metric signals being sent out. Usually the metrics exporter running in the backgroudn is not intrusive at all but it seems that it may be causing your application to get stuck. What specifically is being outputted/shown when you are referring to the application "hanging"? Can you also paste a snippet of how you are using opencensus-ext-azure related components and how the main program is being executed?
So the Python script is running on AKS as a job, I can see in the logs that the main program executed the last instruction because the last instruction is a logging instruction "base_logger.info("Job finished")"
. So normally, the Kubernetes pod state should change from "Running" to "Completed" but sometimes, the pod gets stuck in running state. If I exec into the container and do ps -T -p <python_script_pid>
I can see that there are still running threads which means that the main thread exited but some background threads keep running forever.
This is how I am using the opencensus-ext-azure related components:
from opencensus.ext.azure.log_exporter import AzureLogHandler
from logging import getLevelName, StreamHandler, Formatter, Logger
CONNECT_STRING = os.getenv('insights_key')
def create_and_add_handlers(logger: Logger, level: int):
logger.setLevel(level)
formatter = TimezoneFormatter('%(asctime)s - dataCollectors - %(name)s - %(levelname)s - %(message)s')
handlers = [
AzureLogHandler(connection_string=CONNECT_STRING),
StreamHandler()
] if os.getenv('env') != 'local' else [
StreamHandler()
]
for handler in handlers:
handler.setLevel(level)
handler.setFormatter(formatter)
if len(logger.handlers) < len(handlers):
logger.addHandler(handler)
return logger
if __name__ == '__main__':
base_logger = create_and_add_handlers(getLogger('Some Name'), log_level)
....
base_logger.info("Job finished")
The main program is being executed as a docker container with the command python3 python_script.py
Ah also, from time to time, I see this message in end of the logs warning:opencensus.ext.azure.common.transport:request time out. ingestion may be backed up. retrying.
but I don't necessarily get it whenever the script cannot exit so I am not sure if it is related.
@mosneji It is possible that the background metrics exporter thread is hanging due to trying to flush on exit. This will be fixed in the next release so this specific exporter will not flush upon exit (simply attempts to close).
As for the error messages in logs, it's possible the ingestion endpoint is being overloaded, which is usually when this error message is returned. Your telemetry won't be dropped in this case and will be stored for retry at a later time.
Thank you for looking into this @lzchen. Do you have an estimated release date of the new version? Thanks.
@mosneji Will be looking to release by the end of this month.
@lzchen many thanks.
I have an issue that is maybe related? This happens when the python program finishes and wants to close.
From a pull request 2 weeks ago. This started happening just recently. It works in version 1.1.3, but not in 1.1.4
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "C:\temp\Apps\Anaconda64\lib\logging\__init__.py", line 2141, in shutdown
h.flush()
File "...\Anaconda64\lib\site-packages\opencensus\ext\azure\log_exporter\__init__.py", line 109, in flush
if self._queue.is_empty():
AttributeError: 'Queue' object has no attribute 'is_empty'
self._queue._queue.empty()
works on my end 😮
KR Niklas
@nosterlu
Just to clarify, in version 1.1.4
you are seeing that error message from time to time but not 1.1.3
?
My code fails every time with 1.1.4.
In version 1.1.3 this check for is_empty
does not exist
log_exporter_init_.py change for 1.1.4
Digging deeper on my end, it seems I have a different version of opencensus common, that is not compatible with 1.1.4? Because .is_empty
is not a member of my in Queue in common.schedule.__init__.py
.
I have version 0.8.0 of opencensus, and my scheulde.init.py looks like so.
Hmm, when pip installing 1.1.4, is this maybe not forcing opencensus to move to > 0.8.0?
This is from pip
Requirement already satisfied: opencensus-context==0.1.2 in /home/azureuser/actions-runner-analytics/_work/_tool/Python/3.9.1/x64/lib/python3.9/site-packages (from opencensus<1.0.0,>=0.8.0->opencensus-ext-azure==1.1.4
KR Niklas
Thanks for the investigation. Yes 1.1.4 should have bumped the dependency requirement to opencensus >= 0.9.0
. I will be doing this as part of next release.
@lzchen I am now using version 1.1.5 but the issue I mentioned is still happening. Any advice? Thanks.
@mosneji
Could you try upgrading to opencensus-ext-azure
1.1.6?
Version 1.1.6 seems to fix the problem. Thanks
Is it fixed ? I have similar problem with the latest version.