opencensus-python icon indicating copy to clipboard operation
opencensus-python copied to clipboard

Python script randomly gets stuck after it finishes executing main program

Open mosneji opened this issue 2 years ago • 15 comments

Describe your environment. Python version 3.7.11 on Linux, opencensus-ext-azure v1.1.1, AKS v1.22.6.

Steps to reproduce. I have a python scripts which is using opencensus-ext-azure package to export logs to App Insight. It is running on AKS as a job. From time to time, the pod finishes executing the main program but it gets stuck and doesn't exits. I ran the script with "Python -m trace --trace ... " and I saw that it gets stuck in some code coming from the opencensus-ext-azure package. I cannot reproduce this on demand but I attached Python execution tracing.

What is the expected behavior? Script exits after it finishes executing the main program.

What is the actual behavior? Script hangs after it finishes executing the main program.

python_execution_traces.txt

mosneji avatar Jun 07 '22 12:06 mosneji

Would you happen to be using the metrics exporter?

lzchen avatar Jun 07 '22 16:06 lzchen

Would you happen to be using the metrics exporter?

I am using AzureLogHandler but maybe that is calling the metrics exporter at some stage?

mosneji avatar Jun 08 '22 07:06 mosneji

Opencensus-ext-azure collects network related metrics automatically in the background, which is why in your stack track you see metric signals being sent out. Usually the metrics exporter running in the backgroudn is not intrusive at all but it seems that it may be causing your application to get stuck. What specifically is being outputted/shown when you are referring to the application "hanging"? Can you also paste a snippet of how you are using opencensus-ext-azure related components and how the main program is being executed?

lzchen avatar Jun 08 '22 19:06 lzchen

So the Python script is running on AKS as a job, I can see in the logs that the main program executed the last instruction because the last instruction is a logging instruction "base_logger.info("Job finished")". So normally, the Kubernetes pod state should change from "Running" to "Completed" but sometimes, the pod gets stuck in running state. If I exec into the container and do ps -T -p <python_script_pid> I can see that there are still running threads which means that the main thread exited but some background threads keep running forever.

This is how I am using the opencensus-ext-azure related components:

from opencensus.ext.azure.log_exporter import AzureLogHandler
from logging import getLevelName, StreamHandler, Formatter, Logger

CONNECT_STRING = os.getenv('insights_key')

def create_and_add_handlers(logger: Logger, level: int):
  logger.setLevel(level)
  formatter = TimezoneFormatter('%(asctime)s - dataCollectors - %(name)s - %(levelname)s - %(message)s')
  
  handlers = [
      AzureLogHandler(connection_string=CONNECT_STRING),
      StreamHandler()
  ] if os.getenv('env') != 'local' else [
      StreamHandler()
  ]
  
  for handler in handlers:
      handler.setLevel(level)
      handler.setFormatter(formatter)
      if len(logger.handlers) < len(handlers):
          logger.addHandler(handler)

   return logger

if __name__ == '__main__':
    base_logger = create_and_add_handlers(getLogger('Some Name'), log_level)
    ....
    base_logger.info("Job finished")

The main program is being executed as a docker container with the command python3 python_script.py

mosneji avatar Jun 08 '22 20:06 mosneji

Ah also, from time to time, I see this message in end of the logs warning:opencensus.ext.azure.common.transport:request time out. ingestion may be backed up. retrying. but I don't necessarily get it whenever the script cannot exit so I am not sure if it is related.

mosneji avatar Jun 08 '22 20:06 mosneji

@mosneji It is possible that the background metrics exporter thread is hanging due to trying to flush on exit. This will be fixed in the next release so this specific exporter will not flush upon exit (simply attempts to close).

As for the error messages in logs, it's possible the ingestion endpoint is being overloaded, which is usually when this error message is returned. Your telemetry won't be dropped in this case and will be stored for retry at a later time.

lzchen avatar Jun 10 '22 17:06 lzchen

Thank you for looking into this @lzchen. Do you have an estimated release date of the new version? Thanks.

mosneji avatar Jun 13 '22 06:06 mosneji

@mosneji Will be looking to release by the end of this month.

lzchen avatar Jun 14 '22 17:06 lzchen

@lzchen many thanks.

mosneji avatar Jun 15 '22 07:06 mosneji

I have an issue that is maybe related? This happens when the python program finishes and wants to close.

From a pull request 2 weeks ago. This started happening just recently. It works in version 1.1.3, but not in 1.1.4

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "C:\temp\Apps\Anaconda64\lib\logging\__init__.py", line 2141, in shutdown
    h.flush()
  File "...\Anaconda64\lib\site-packages\opencensus\ext\azure\log_exporter\__init__.py", line 109, in flush
    if self._queue.is_empty():
AttributeError: 'Queue' object has no attribute 'is_empty'

self._queue._queue.empty() works on my end 😮

KR Niklas

nosterlu avatar Jun 21 '22 16:06 nosterlu

@nosterlu Just to clarify, in version 1.1.4 you are seeing that error message from time to time but not 1.1.3?

lzchen avatar Jun 21 '22 22:06 lzchen

My code fails every time with 1.1.4.

In version 1.1.3 this check for is_empty does not exist log_exporter_init_.py change for 1.1.4

Digging deeper on my end, it seems I have a different version of opencensus common, that is not compatible with 1.1.4? Because .is_empty is not a member of my in Queue in common.schedule.__init__.py.

I have version 0.8.0 of opencensus, and my scheulde.init.py looks like so.

Hmm, when pip installing 1.1.4, is this maybe not forcing opencensus to move to > 0.8.0?

This is from pip Requirement already satisfied: opencensus-context==0.1.2 in /home/azureuser/actions-runner-analytics/_work/_tool/Python/3.9.1/x64/lib/python3.9/site-packages (from opencensus<1.0.0,>=0.8.0->opencensus-ext-azure==1.1.4

KR Niklas

nosterlu avatar Jun 22 '22 08:06 nosterlu

Thanks for the investigation. Yes 1.1.4 should have bumped the dependency requirement to opencensus >= 0.9.0. I will be doing this as part of next release.

lzchen avatar Jun 22 '22 16:06 lzchen

@lzchen I am now using version 1.1.5 but the issue I mentioned is still happening. Any advice? Thanks.

mosneji avatar Aug 05 '22 07:08 mosneji

@mosneji Could you try upgrading to opencensus-ext-azure 1.1.6?

lzchen avatar Aug 15 '22 17:08 lzchen

Version 1.1.6 seems to fix the problem. Thanks

mosneji avatar Oct 30 '22 08:10 mosneji

Is it fixed ? I have similar problem with the latest version.

waldimen avatar Nov 29 '23 16:11 waldimen