avocado icon indicating copy to clipboard operation
avocado copied to clipboard

Avocado's next runner hanging indefinitely at some tests

Open pevogam opened this issue 4 years ago • 24 comments
trafficstars

This is a report of an issue that I have experienced for some time but didn't share until I removed LXC container spawners and other functionality that could have been the reason for it. Right now I am using simple process spawners (therefore cannot run in parallel but could rule out causes) and still experience this so it is time to report it. This issue will also gradually be updated with more information I obtain from some diagnostics later on.

The issue is unfortunately quite fatal for test runs, namely a test run could hang indefinitely at a given (VT) test without any timeouts or other failure taking place to be able to at least go on and finish the test runs with fewer casualties. Simple stracing of the child process:

 206621 pts/4    Sl+    3:19              \_ /usr/bin/python3 /usr/bin/avocado-runner-avocado-vt task-run -i 411-mis1-normal.nongui.backup[...] -j 4d18092ffffcad098f48e6d76044d856d3fb8b5e -k avocado-vt -u normal.nongui.backup[...] -c {} name=normal.nongui.backup[...] shortname=normal.nongui.backup[...] [other-params] -s 127.0.0.1:8888

reveals a problem with a lock:

# strace -p 206621
strace: Process 206621 attached
futex(0x7fec6d7b5000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY

Some gdb-based debugging shows we are stuck at a logging message (disclaimer: extracted from another occurrence where I stored it):

(gdb) py-list
  90        def _make_methods(self):
  91            self.acquire = self._semlock.acquire
  92            self.release = self._semlock.release
  93    
  94        def __enter__(self):
 >95            return self._semlock.__enter__()
  96    
  97        def __exit__(self, *args):
  98            return self._semlock.__exit__(*args)
  99    
 100        def __getstate__(self):
(gdb) py-bt
Traceback (most recent call first):
  File "/usr/lib64/python3.9/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 377, in put
    with self._wlock:
  File "/usr/lib/python3.9/site-packages/avocado/core/runners/utils/messages.py", line 158, in emit
    self.queue.put(self.message.get(msg))
  File "/usr/lib64/python3.9/logging/__init__.py", line 952, in handle
    self.emit(record)
  File "/usr/lib64/python3.9/logging/__init__.py", line 1661, in callHandlers
    hdlr.handle(record)
  File "/usr/lib64/python3.9/logging/__init__.py", line 1599, in handle
    self.callHandlers(record)
  File "/usr/lib64/python3.9/logging/__init__.py", line 1845, in _log
    
  File "/usr/lib64/python3.9/logging/__init__.py", line 1434, in debug
    self._log(DEBUG, msg, args, **kwargs)
  File "/usr/lib/python3.9/site-packages/Pyro4/core.py", line 400, in _pyroRelease
    log.debug("connection released")
  File "/usr/lib/python3.9/site-packages/Pyro4/core.py", line 266, in __del__
    self._pyroRelease()
  Garbage-collecting
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 205, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 378, in put
    self._writer.send_bytes(obj)
  File "/usr/lib/python3.9/site-packages/avocado/core/runners/utils/messages.py", line 176, in write
    self.queue.put(self.message.get(buf))
  File "/usr/lib64/python3.9/logging/__init__.py", line 1086, in emit
    stream.write(msg + self.terminator)
  File "/usr/lib64/python3.9/logging/__init__.py", line 952, in handle
    self.emit(record)
  File "/usr/lib64/python3.9/logging/__init__.py", line 1661, in callHandlers
    hdlr.handle(record)
  File "/usr/lib64/python3.9/logging/__init__.py", line 1599, in handle
    self.callHandlers(record)
  File "/usr/lib64/python3.9/logging/__init__.py", line 1845, in _log
    
  File "/usr/lib64/python3.9/logging/__init__.py", line 1512, in log
    self._log(level, msg, args, **kwargs)
  File "/usr/lib/python3.9/site-packages/pyi2ncommon/log_helpers.py", line 310, in _log_no_test
    self._log.log(level, message_formatted, **kwargs)
  File "/usr/lib/python3.9/site-packages/pyi2ncommon/log_helpers.py", line 264, in info
    self._log_no_test(INFO, message, *args, **kwargs)
  File "/mnt/local/intra2net-qa/tp_i2n/tests/backup_restore.py", line 1676, in run
  File "/usr/lib/python3.9/site-packages/avocado_vt/plugins/vt_runner.py", line 900, in run_test
  File "/usr/lib64/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib64/python3.9/multiprocessing/process.py", line 571, in _bootstrap
  File "/usr/lib64/python3.9/multiprocessing/popen_fork.py", line 71, in _launch
    code = process_obj._bootstrap(parent_sentinel=child_r)
  File "/usr/lib64/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib64/python3.9/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/usr/lib64/python3.9/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/lib64/python3.9/multiprocessing/process.py", line 633, in start
  File "/usr/lib/python3.9/site-packages/avocado_vt/plugins/vt_runner.py", line 217, in run
    process.start()
  File "/root/avocado-runner", line 839, in run
    for status in runner.run():
  File "/root/avocado-runner", line 1316, in command_task_run
  File "/root/avocado-runner", line 971, in run
    return kallable(args)
  File "/root/avocado-runner", line 1084, in main
    app.run()
  File "/root/avocado-runner", line 1088, in <module>
    main()

Has anyone experienced such logging message hangs with the next runner before? My problem is that I have had cases where I such tests were left for days and the test run never progressed.

This is quite a fatal issue to make us use the next runner and it could be related to asyncio but I will investigate more and let you know if I locate some cause. If anyone has seen this before or has a quick clue though, they are welcome to chime in.

pevogam avatar Sep 03 '21 13:09 pevogam

--Type <RET> for more, q to quit, c to continue without paging--

This caught my attention. Do you recognize this?

beraldoleal avatar Sep 03 '21 13:09 beraldoleal

Sorry, this is a remnant from gdb, it shouldn't be related to the stack trace in any way. I thought I cleaned up these lines when copying but will do so properly now.

pevogam avatar Sep 03 '21 13:09 pevogam

To add some more background:

This issue can happen on different tests and independently of the hardware load. It is thus not 100% reproducible but highly probable in tests involving 1000+ tests. Once it does happen, the test run will not move until the test is interrupted in which case it is registered as SKIP-ed and the run continues as normal (no errors in follow up tests). If there are thousands more tests the hanging can happen again in one of them before the run finishes.

pevogam avatar Sep 03 '21 13:09 pevogam

Another strack trace I obtained just now:

  File "/usr/lib64/python3.9/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 377, in put
    with self._wlock:
  File "/usr/lib/python3.9/site-packages/avocado/core/runners/utils/messages.py", line 176, in write
    self.queue.put(self.message.get(buf))
  File "/usr/lib64/python3.9/warnings.py", line 30, in _showwarnmsg_impl
    file.write(text)
  File "/usr/lib64/python3.9/warnings.py", line 368, in _showwarnmsg
    
  Garbage-collecting
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 205, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 378, in put
    self._writer.send_bytes(obj)
  File "/usr/lib/python3.9/site-packages/avocado/core/runners/utils/messages.py", line 158, in emit
    self.queue.put(self.message.get(msg))
  File "/usr/lib64/python3.9/logging/__init__.py", line 952, in handle
    self.emit(record)
  File "/usr/lib64/python3.9/logging/__init__.py", line 1661, in callHandlers
    hdlr.handle(record)
  File "/usr/lib64/python3.9/logging/__init__.py", line 1599, in handle
    self.callHandlers(record)
  File "/usr/lib64/python3.9/logging/__init__.py", line 1845, in _log
    
  File "/usr/lib64/python3.9/logging/__init__.py", line 1446, in info
    self._log(INFO, msg, args, **kwargs)
  File "/usr/lib/python3.9/site-packages/avocado/utils/process.py", line 649, in _init_subprocess
    log.info("Running '%s'", self.cmd)
  File "/usr/lib/python3.9/site-packages/avocado/utils/process.py", line 952, in run
    self._init_subprocess()
  File "/usr/lib/python3.9/site-packages/avocado/utils/process.py", line 1335, in run
    text = cmd_result.stdout_text

The test was running a command on the host machine with avocado's process.run() and it got stuck here (the stack trace is fro more recent avocado version than the previous one but ultimately the lock is in the same location:

  90        def _make_methods(self):
  91            self.acquire = self._semlock.acquire
  92            self.release = self._semlock.release
  93    
  94        def __enter__(self):
 >95            return self._semlock.__enter__()
  96    
  97        def __exit__(self, *args):
  98            return self._semlock.__exit__(*args)
  99    
 100        def __getstate__(self):

It seems that once the process.run was called and it wanted to log the command to run

  File "/usr/lib/python3.9/site-packages/avocado/utils/process.py", line 649, in _init_subprocess
    log.info("Running '%s'", self.cmd)

this reached to the same deadlock situation with the logging, just like in the previously included stack trace above.

pevogam avatar Sep 17 '21 09:09 pevogam

I think I have hit here is a very special circumstance, described well here:

https://codewithoutrules.com/2017/08/16/concurrency-python/

The crux of the problem is in the mix of queue, garbage collection, and the current logging used by Avocado and many standard applications. If garbage collection happens while we have entered a queue this will lead to reentrancy that will as a result deadlock the entire application. Note that the author mentions this is a common problem with logging and all cases above have both logging and garbage collection happening at some point. This also very well explains why I can get this hanging at various seemingly random locations.

pevogam avatar Sep 17 '21 16:09 pevogam

Here are some of the stack traces I have collected from different runs with explanation for what happened in each:

Traceback (most recent call first):
  File "/usr/lib64/python3.9/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 377, in put
    with self._wlock:
  File "/usr/lib/python3.9/site-packages/avocado/core/runners/utils/messages.py", line 176, in write
    self.queue.put(self.message.get(buf))
  File "/usr/lib64/python3.9/warnings.py", line 30, in _showwarnmsg_impl
    file.write(text)
  File "/usr/lib64/python3.9/warnings.py", line 368, in _showwarnmsg
    
  Garbage-collecting
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 205, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 378, in put
    self._writer.send_bytes(obj)
  File "/usr/lib/python3.9/site-packages/avocado/core/runners/utils/messages.py", line 158, in emit
    self.queue.put(self.message.get(msg))
  File "/usr/lib64/python3.9/logging/__init__.py", line 952, in handle
    self.emit(record)
  File "/usr/lib64/python3.9/logging/__init__.py", line 1661, in callHandlers
    hdlr.handle(record)
  File "/usr/lib64/python3.9/logging/__init__.py", line 1599, in handle
    self.callHandlers(record)
  File "/usr/lib64/python3.9/logging/__init__.py", line 1845, in _log
    
  File "/usr/lib64/python3.9/logging/__init__.py", line 1446, in info
    self._log(INFO, msg, args, **kwargs)
  File "/usr/lib/python3.9/site-packages/avocado/utils/process.py", line 649, in _init_subprocess
    log.info("Running '%s'", self.cmd)
  File "/usr/lib/python3.9/site-packages/avocado/utils/process.py", line 952, in run
    self._init_subprocess()
  File "/usr/lib/python3.9/site-packages/avocado/utils/process.py", line 1335, in run
    text = cmd_result.stdout_text

Here you can look at the part before and after the garbage collection. What happened on the Avocado side was that we ran process.run() with some command which just happened to log "Running " message and emit it to the queue. However right after obtaining the lock for the queue, python interrupted the process for garbage collection. Once this happened (i.e. the trace above " Garbage-collecting") some garbage collected object emitted a python warning to the logs which in turn requested a lock for the already locked queue. And as you can guess at this stage, this is a deadlock.

Traceback (most recent call first):
  File "/usr/lib64/python3.9/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 377, in put
    with self._wlock:
  File "/usr/lib/python3.9/site-packages/avocado/core/runners/utils/messages.py", line 158, in emit
    self.queue.put(self.message.get(msg))
  File "/usr/lib64/python3.9/logging/__init__.py", line 952, in handle
    self.emit(record)
  File "/usr/lib64/python3.9/logging/__init__.py", line 1661, in callHandlers
    hdlr.handle(record)
  File "/usr/lib64/python3.9/logging/__init__.py", line 1599, in handle
    self.callHandlers(record)
  File "/usr/lib64/python3.9/logging/__init__.py", line 1845, in _log
    
  File "/usr/lib64/python3.9/logging/__init__.py", line 1434, in debug
    self._log(DEBUG, msg, args, **kwargs)
  File "/usr/lib/python3.9/site-packages/Pyro4/core.py", line 400, in _pyroRelease
    log.debug("connection released")
  File "/usr/lib/python3.9/site-packages/Pyro4/core.py", line 266, in __del__
    self._pyroRelease()
  Garbage-collecting
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 205, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 378, in put
    self._writer.send_bytes(obj)
  File "/usr/lib/python3.9/site-packages/avocado/core/runners/utils/messages.py", line 158, in emit
    self.queue.put(self.message.get(msg))
  File "/usr/lib64/python3.9/logging/__init__.py", line 952, in handle
    self.emit(record)
  File "/usr/lib64/python3.9/logging/__init__.py", line 1661, in callHandlers
    hdlr.handle(record)
  File "/usr/lib64/python3.9/logging/__init__.py", line 1599, in handle
    self.callHandlers(record)
  File "/usr/lib64/python3.9/logging/__init__.py", line 1845, in _log
    
  File "/usr/lib64/python3.9/logging/__init__.py", line 1434, in debug
    self._log(DEBUG, msg, args, **kwargs)
  File "/usr/lib64/python3.9/logging/__init__.py", line 2107, in debug
    root.debug(msg, *args, **kwargs)
  File "/mnt/local/intra2net-qa/tp_i2n/tests/email_autodelete.py", line 301, in find_mail
    logging.debug('Found {} mails, skipped {}'.format(len(result), n_skipped))
  File "/mnt/local/intra2net-qa/tp_i2n/tests/email_autodelete.py", line 4822, in run

Here were were logging a regular message from a test for email autodeletion but once we entered the queue code and in particular sent_bytes, the process was interrupted for garbage collection from a connection whose socket was getting closed. However, during the closing there was an emitted logging message "connection released" which needed access to the queue and started waiting for the lock to be released. However, as the pre-gc code is already using the queue and waiting for the garbage collection to complete, it is waiting forever too. So another deadlock.

I have more cases but I guess you can get the point now - any garbage collection that emits a message needing the (supposedly thread safe) queue will end up in a deadlock. For thousands of tests I usually obtain at least 2-3 cases like these, making the test runs rather unusable at their current stage as this is a fatal error the test runs cannot recover from in exchange of just wasting a few tests.

pevogam avatar Sep 17 '21 17:09 pevogam

@clebergnu @richtja I have noticed you are using SimpleQueue for this on the Avocado (instrumented tests) and Avocado VT sides which is supposed to be fixed after Python 3.7 and the above stack traces are from Python 3.9 but the issue seems suspiciously similar.

pevogam avatar Sep 17 '21 17:09 pevogam

Here is the CPython discussion about this issue:

https://bugs.python.org/issue14976

And the proposed fix:

https://github.com/python/cpython/commit/94e1696d04c65e19ea52e5c8664079c9d9aa0e54

To make matters more confusing, it seems that you are using multiprocessing.SimpleQueue() as in https://docs.python.org/3/library/multiprocessing.html?highlight=simplequeue#multiprocessing.SimpleQueue and the fix is in queue.SimpleQueue() as in https://docs.python.org/3/library/queue.html?highlight=simplequeue#queue.SimpleQueue:

>>> from multiprocessing import SimpleQueue as s1
>>> from queue import SimpleQueue as s2
>>> s1 == s2
False
>>> s1
<bound method BaseContext.SimpleQueue of <multiprocessing.context.DefaultContext object at 0x7fbe2dd5e040>>
>>> s2
<class '_queue.SimpleQueue'>

I could understand the reasoning behind using a queue for multi-processing vs multi-threading but I am not sure the first is covered by the fix.

pevogam avatar Sep 17 '21 18:09 pevogam

Hi @clebergnu by adding this to #93 I assume it won't be fixed or thought about in the 92.0 LTS? It is a fairly dangerous issue as I have tried to underline above but if I understand you right we don't have the resources to investigate now. Is this correct?

pevogam avatar Oct 04 '21 17:10 pevogam

Hi @clebergnu by adding this to #93 I assume it won't be fixed or thought about in the 92.0 LTS? It is a fairly dangerous issue as I have tried to underline above but if I understand you right we don't have the resources to investigate now. Is this correct?

Hi @pevogam , the reason this is not on sprint 92, is because we failed to come up with a reproducer that could, in turn, become an issue marked as bug to be worked on for 92.0.

But, given that 92.x is going to be LTS, it can still accept bug fixes which will make into minor releases. So, we can still work on this, fix the issue on the master branch, backport it to the 92lts branch (to be created) and then release 92.1.

Does that make sense?

clebergnu avatar Oct 07 '21 15:10 clebergnu

Hi @clebergnu as we already have a fix on our side (disabling some destructor logging) this is fine with us. My general point was that this could still affect various people though and really catch them by surprise. Besides, forever hanging test run is not a form of graceful failure. But I only mean this for other people, as we have figured out at least part of the problem we also have a complete solution on our side. Hopefully others stumble upon this report if they experience indefinite hanging test runs.

Regarding your reproducer comment, I think it should be pretty easy to come up with one considering the findings above: simply adding a tiny extlib with a class Helper that calls logging.info("GC time") during its destruction and a test that makes intensive use of its instance should be enough.

pevogam avatar Oct 07 '21 15:10 pevogam

Regarding your reproducer comment, I think it should be pretty easy to come up with one considering the findings above: simply adding a tiny extlib with a class Helper that calls logging.info("GC time") during its destruction and a test that makes intensive use of its instance should be enough.

Could you please elaborate here? If you could give us the exact code that is causing the issue, would be great.

I tried with this simple test bellow, however I could not reproduce so far:

# test.py
from avocado import Test
from extlib import Helper

class T(Test):
    def test_foo(self):
        for i in range(1, 100000):
            h = Helper()
            self.assertTrue(h.true())
# extlib.py
import logging

class Helper:
    def true(self):
        return True

    def __del__(self):
        logging.info("GC time")

beraldoleal avatar Oct 07 '21 16:10 beraldoleal

Hmmm ok, I will add the reproducer here even though as I already said before it is fairly straightforward and already well explained in a lot of detail in https://codewithoutrules.com/2017/08/16/concurrency-python/. The main reason I did the research above was because the issue is very tough to realize at first but easy to comprehend once you read the findings and ultimately very intuitive to think about if you have been through all the sources and details I provided.

First things first, the logging is broken at the moment. The above won't print anything in the debug.log of the test and potentially enter the queue. So we need the current logging fix in both modules:

import logging as log
logging = log.getLogger('avocado.test')

Then once this is done we will also have to cause the logging collusion described above which will require queue reentrancy. This means that we have to provide another logging message (here "iteration") outside of the destructor:

from avocado import Test
from extlib import Helper

import logging as log
logging = log.getLogger('avocado.test')

class T(Test):
    def test_foo(self):
        for i in range(1, 100000):
            h = Helper()
            self.assertTrue(h.true())
            logging.info(f"Iteration {i}")

Finally, we need a single self reference (here "circular") to postpone GC (also with reasoning described in much detail in the link) of the object:

import logging as log
logging = log.getLogger('avocado.test')

class Helper:
    def __init__(self):
        self.circular = self

    def true(self):
        return True

    def __del__(self):
        logging.info("GC time")

The resulting hanging in the run happened fairly quickly:

[stdlog] 2021-10-07 20:46:08,534 test             L0312 INFO | INIT 1-T.test_foo
[stdlog] 2021-10-07 20:46:08,536 parameters       L0142 DEBUG| PARAMS (key=timeout, path=*, default=None) => None
[stdlog] 2021-10-07 20:46:08,536 test             L0340 DEBUG| Test metadata:
[stdlog] 2021-10-07 20:46:08,537 test             L0342 DEBUG|   filename: /root/test.py
[stdlog] 2021-10-07 20:46:08,537 test             L0348 DEBUG|   teststmpdir: /var/tmp/avocado_wt87iyhu
[stdlog] 2021-10-07 20:46:08,538 test             L0538 INFO | START 1-T.test_foo
[stdlog] 2021-10-07 20:46:08,538 test             L0207 DEBUG| DATA (filename=output.expected) => NOT FOUND (data sources: variant, test, file)
[stdlog] 2021-10-07 20:46:08,539 test             L0012 INFO | Iteration 1
[stdlog] 2021-10-07 20:46:08,539 test             L0012 INFO | Iteration 2
[stdlog] 2021-10-07 20:46:08,540 test             L0012 INFO | Iteration 3
[stdlog] 2021-10-07 20:46:08,540 test             L0012 INFO | Iteration 4
[stdlog] 2021-10-07 20:46:08,540 test             L0012 INFO | Iteration 5
[stdlog] 2021-10-07 20:46:08,540 test             L0012 INFO | Iteration 6
[stdlog] 2021-10-07 20:46:08,540 test             L0012 INFO | Iteration 7
[stdlog] 2021-10-07 20:46:08,541 test             L0012 INFO | Iteration 8
[stdlog] 2021-10-07 20:46:08,541 test             L0012 INFO | Iteration 9
[stdlog] 2021-10-07 20:46:08,541 test             L0012 INFO | Iteration 10
[stdlog] 2021-10-07 20:46:08,541 test             L0012 INFO | Iteration 11
[stdlog] 2021-10-07 20:46:08,541 test             L0012 INFO | Iteration 12
[stdlog] 2021-10-07 20:46:08,541 test             L0012 INFO | Iteration 13
[stdlog] 2021-10-07 20:46:08,542 test             L0012 INFO | Iteration 14
[stdlog] 2021-10-07 20:46:08,542 test             L0012 INFO | Iteration 15
[stdlog] 2021-10-07 20:46:08,542 test             L0012 INFO | Iteration 16
[stdlog] 2021-10-07 20:46:08,542 test             L0012 INFO | Iteration 17
[stdlog] 2021-10-07 20:46:08,542 test             L0012 INFO | Iteration 18
[stdlog] 2021-10-07 20:46:08,543 test             L0012 INFO | Iteration 19
[stdlog] 2021-10-07 20:46:08,543 test             L0012 INFO | Iteration 20
[stdlog] 2021-10-07 20:46:08,543 test             L0012 INFO | Iteration 21
[stdlog] 2021-10-07 20:46:08,543 test             L0012 INFO | Iteration 22
[stdlog] 2021-10-07 20:46:08,543 test             L0012 INFO | Iteration 23
[stdlog] 2021-10-07 20:46:08,544 test             L0012 INFO | Iteration 24

and the run never continued beyond this iteration. A quick strace confirmation shows the same signature as above, namely deadlock:

# strace -p 14428
strace: Process 14428 attached
futex(0x7ff69031c000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
futex(0x7ff69031c000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
futex(0x7ff69031c000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY^Cstrace: Process 14428 detached

For more details and history on the issue please see above where I have added various sources and explanations, including a confirmation from the stack traces.

pevogam avatar Oct 07 '21 18:10 pevogam

Hmmm ok, I will add the reproducer here even though as I already said before it is fairly straightforward and already well explained in a lot of detail in https://codewithoutrules.com/2017/08/16/concurrency-python/. The main reason I did the research above was because the issue is very tough to realize at first but easy to comprehend once you read the findings and ultimately very intuitive to think about if you have been through all the sources and details I provided.´

Hi @pevogam,

Thanks for going through the extensive steps to document and debug this problem. But, let me take those in perspective. You contributed with:

  • An initial description, where your example suggests the problem only happens with "avocado-vt" which itself has its quirks for some.
  • Suggestions that the problem could be with asyncio, logging message, and at the same time not 100% reproducible.
  • More debugging, you conclude that this could be a "very special circumstance" and could be related to a mix of queues, garbage collection, and the current Avocado's logging system.
  • Then you posted here a few external links (including issues, discussions, patches, and an article).

Only after all that work, you conclude that it was pretty easy to come up with a reproducible, when asked for one, but you didn't mention the circular reference and other details, assuming all contributors on this thread would be on the same page as you. The “turning point” was really the reproducer, complete with the logging information at the garbage collection time.

Contributors willing to take your problem will need to go one extra mile to come up with a reproducible and have the same conclusion as you had. Also, it is a good practice that we find everything needed to fix an issue inside the same issue, this will speed up the resolution process. That was the main reason we added an issue template in our repository asking for a reproducible, whenever possible. So that the community can work better together. And IMO a reproducer has an important role here.

So, it is ok to not bring to us a reproducible, but please keep in mind your issues might take a longer time to be solved.

That being said, I finally understood the problem and was able to reproduce it. We could replace our multiprocessing.SimpleQueue with queue.SimpleQueue. Having a quick look, they look very similar in terms of API, but I would need more time to double-check all our cases.

Again, thanks for your persistence here.

beraldoleal avatar Oct 08 '21 20:10 beraldoleal

Thanks for going through the extensive steps to document and debug this problem. But, let me take those in perspective. You contributed with:

* An initial description, where your example suggests the problem only happens with "avocado-vt" which itself has its quirks for some.

* Suggestions that the problem could be with asyncio, logging message, and at the same time not 100% reproducible.

* More debugging, you conclude that this could be a "very special circumstance" and could be related to a mix of queues, garbage collection, and the current Avocado's logging system.

* Then you posted here a few external links (including issues, discussions, patches, and an article).

Indeed, there were quite many steps but this is because I went through the investigation and posted my findings along the way. It might not be perfect but I guess it is better than not posting any results and leaving all the effort to the maintainers.

Only after all that work, you conclude that it was pretty easy to come up with a reproducible, when asked for one, but you didn't mention the circular reference and other details, assuming all contributors on this thread would be on the same page as you. The “turning point” was really the reproducer, complete with the logging information at the garbage collection time.

From your paragraphs here it seems to me that you never followed the links which are crucial details. And in complex issues like these I doubt I can paste the entire article and its findings here in order to follow the practice of having everything within the issue page. In some cases such behaviors are simply not feasible if not impractical.

Contributors willing to take your problem will need to go one extra mile to come up with a reproducible and have the same conclusion as you had. Also, it is a good practice that we find everything needed to fix an issue inside the same issue, this will speed up the resolution process. That was the main reason we added an issue template in our repository asking for a reproducible, whenever possible. So that the community can work better together. And IMO a reproducer has an important role here.

So, it is ok to not bring to us a reproducible, but please keep in mind your issues might take a longer time to be solved.

Please bear with me here - as I put time and effort to explain everything in as much detail and references as I could I think it is natural to expect reciprocal amount of attention to the details I provided (especially considering the issue is not straightforward to debug) and everybody could feel somewhat unhappy otherwise. The reproducer code is something I keep hearing about only in the last weeks/months and it could be made a standard requirement but it remains new nevertheless. As maintainers however please be careful not to brush through an issue with sufficient details (incl. easy reproducer code in the references) just because your minds can grow accustomed to not diving into issues with only an explanation.

We already concluded that this issue probably won't make it in the LTS release and on our side we already have a working solution so I am fine with the issue taking longer to think about, read through carefully, and potentially be fixed. I hope there won't be many people out there becoming victims of such indefinite hanging even though as the main reference article says, many people using excessive logging might. Thus, please take your time and if you have any later spare cycles we can look into this issue in more detail whenever there is less pressure.

That being said, I finally understood the problem and was able to reproduce it. We could replace our multiprocessing.SimpleQueue with queue.SimpleQueue. Having a quick look, they look very similar in terms of API, but I would need more time to double-check all our cases.

I am not sure this will help since I vaguely recall I tried something like it but there were additional differences involved.

pevogam avatar Oct 08 '21 20:10 pevogam

Thanks for going through the extensive steps to document and debug this problem. But, let me take those in perspective. You contributed with:

* An initial description, where your example suggests the problem only happens with "avocado-vt" which itself has its quirks for some.

* Suggestions that the problem could be with asyncio, logging message, and at the same time not 100% reproducible.

* More debugging, you conclude that this could be a "very special circumstance" and could be related to a mix of queues, garbage collection, and the current Avocado's logging system.

* Then you posted here a few external links (including issues, discussions, patches, and an article).

Indeed, there were quite many steps but this is because I went through the investigation and posted my findings along the way. It might not be perfect but I guess it is better than not posting any results and leaving all the effort to the maintainers.

As I started my reply "Thanks for going through the extensive steps to document and debug this problem". I'm not questioning the value here.

Only after all that work, you conclude that it was pretty easy to come up with a reproducible, when asked for one, but you didn't mention the circular reference and other details, assuming all contributors on this thread would be on the same page as you. The “turning point” was really the reproducer, complete with the logging information at the garbage collection time.

From your paragraphs here it seems to me that you never followed the links which are crucial details. And in complex issues like these I doubt I can paste the entire article and its findings here in order to follow the practice of having everything within the issue page. In some cases such behaviors are simply not feasible if not impractical.

My reply it was meant to help you to get more attention into the issues you report. Others might don't have time for doing this, and it is natural that the interest to solve an issue is inversely proportional to the number of steps needed to reproduce it (or the number of links that you have to follow). Of course, that any link and references are helpful. We also understand that in some cases we can't provide details. To prove that, we added the section "Additional information" in our template and we left the traditional "blank issue" (without template), so we can cover both cases.

Contributors willing to take your problem will need to go one extra mile to come up with a reproducible and have the same conclusion as you had. Also, it is a good practice that we find everything needed to fix an issue inside the same issue, this will speed up the resolution process. That was the main reason we added an issue template in our repository asking for a reproducible, whenever possible. So that the community can work better together. And IMO a reproducer has an important role here. So, it is ok to not bring to us a reproducible, but please keep in mind your issues might take a longer time to be solved.

Please bear with me here - as I put time and effort to explain everything in as much detail and references as I could I think it is natural to expect reciprocal amount of attention to the details I provided (especially considering the issue is not straightforward to debug) and everybody could feel somewhat unhappy otherwise. The reproducer code is something I keep hearing about only in the last weeks/months and it could be made a standard requirement but it remains new nevertheless. As maintainers however please be careful not to brush through an issue with sufficient details (incl. easy reproducer code in the references) just because your minds can grow accustomed to not diving into issues with only an explanation.

You are making conclusions at your own. We have being using this 'explanation-issues' for a long time (and probably we are going to keep using it for some cases). But as the project grows we quickly realize that this does not scale, specially with a small team. We also understand that is something new, and that is why we are trying it and pushing for it whenever possible, so we can prove its value. But keep in mind that in this case we were not pushing you to follow the template. Here, we just asked for the reproducible when you stated that "it was easy". My final suggestion here is: always assume good intentions, sometimes we are just busy solving other issues.

beraldoleal avatar Oct 11 '21 11:10 beraldoleal

Only after all that work, you conclude that it was pretty easy to come up with a reproducible, when asked for one, but you didn't mention the circular reference and other details, assuming all contributors on this thread would be on the same page as you. The “turning point” was really the reproducer, complete with the logging information at the garbage collection time.

From your paragraphs here it seems to me that you never followed the links which are crucial details. And in complex issues like these I doubt I can paste the entire article and its findings here in order to follow the practice of having everything within the issue page. In some cases such behaviors are simply not feasible if not impractical.

My reply it was meant to help you to get more attention into the issues you report. Others might don't have time for doing this, and it is natural that the interest to solve an issue is inversely proportional to the number of steps needed to reproduce it (or the number of links that you have to follow). Of course, that any link and references are helpful. We also understand that in some cases we can't provide details. To prove that, we added the section "Additional information" in our template and we left the traditional "blank issue" (without template), so we can cover both cases.

Yes, the inversely proportional interest makes sense. This is why I tried to compensate for it by underlinking the criticality of the bug for the general user.

I am having trouble finding this template, is it suggested automatically on each issue reporting from Github itself? Because all the issues I have reported so far did not have it so I can't comment much on it.

Contributors willing to take your problem will need to go one extra mile to come up with a reproducible and have the same conclusion as you had. Also, it is a good practice that we find everything needed to fix an issue inside the same issue, this will speed up the resolution process. That was the main reason we added an issue template in our repository asking for a reproducible, whenever possible. So that the community can work better together. And IMO a reproducer has an important role here. So, it is ok to not bring to us a reproducible, but please keep in mind your issues might take a longer time to be solved.

Please bear with me here - as I put time and effort to explain everything in as much detail and references as I could I think it is natural to expect reciprocal amount of attention to the details I provided (especially considering the issue is not straightforward to debug) and everybody could feel somewhat unhappy otherwise. The reproducer code is something I keep hearing about only in the last weeks/months and it could be made a standard requirement but it remains new nevertheless. As maintainers however please be careful not to brush through an issue with sufficient details (incl. easy reproducer code in the references) just because your minds can grow accustomed to not diving into issues with only an explanation.

You are making conclusions at your own. We have being using this 'explanation-issues' for a long time (and probably we are going to keep using it for some cases). But as the project grows we quickly realize that this does not scale, specially with a small team. We also understand that is something new, and that is why we are trying it and pushing for it whenever possible, so we can prove its value. But keep in mind that in this case we were not pushing you to follow the template.

Actually, if I had a template I think I would have preferred that. I assume there are some plans to either document the template (and announce it, I apologize if I don't know much about it) or a plan to simply make it the default for reporting new issues.

Here, we just asked for the reproducible when you stated that "it was easy". My final suggestion here is: always assume good intentions, sometimes we are just busy solving other issues.

I think it is possible that me insisting in paying attention to this issue because of its criticality might have created unnecessary pressure leading to examining all the findings above less carefully and hoping for a quicker solution like the reproducer code above. Sorry if that was the case. In any case, we are on the same page now so we can focus on resolving the actual issue later on.

pevogam avatar Oct 11 '21 14:10 pevogam

This feels a lot similar to this: https://github.com/avocado-framework/avocado/pull/4648/commits/200c343c967f4bc2e971cac54de4e4897b5d2a34

clebergnu avatar Oct 27 '21 14:10 clebergnu

@pevogam #5112 might be a fix. I could not reproduce this after this small change. Can you please double-check?

However, two tests on optional_plugins are failing locally. But I could not find a relationship with this change. I will let the CI run to see if is something with my env.

check-1021-optional_plugins/varianter_yaml_to_mux/tests/test_functional.py:MultiplexTests.test_run_mplex_passtest FAIL
check-1028-optional_plugins/varianter_yaml_to_mux/tests/test_functional.py:MultiplexTests.test_mux_inject FAIL

beraldoleal avatar Nov 09 '21 21:11 beraldoleal

For some reason, I'm triggering this easily on my setup, attempting to package this for GNU Guix. It hangs while running its own test suite. The deadlock happens both in the containerized build environment or outside of it. On a powerful 24 cores machine it hangs after all tests have run while on a 2nd old 4 cores desktop it seems to hang consistently on the 456th test:

 (unit-493/656) selftests/unit/utils/test_iso9660.py:Capabilities.test_capabilities_pycdlib: STARTED
 (unit-492/656) selftests/unit/utils/test_genio.py:TestGenio.test_are_files_equal: STARTED
 (unit-492/656) selftests/unit/utils/test_genio.py:TestGenio.test_are_files_equal:  PASS (0.56 s)
 (unit-494/656) selftests/unit/utils/test_iso9660.py:Capabilities.test_capabilities_nobackend: STARTED
 (unit-495/656) selftests/unit/utils/test_iso9660.py:Capabilities.test_non_existing_capabilities: STARTED
 (unit-496/656) selftests/unit/utils/test_iso9660.py:IsoInfo.test_basic_workflow: STARTED

The tail of the strace log seems to indicate the problem is the same as encountered by the original poster:

17383 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL <unfinished ...>
17374 poll([{fd=3, events=POLLIN}], 1, 0) = 0 (Timeout)
17374 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL <unfinished ...>
17381 <... pselect6 resumed>)           = 0 (Timeout)
17381 poll([{fd=3, events=POLLIN}], 1, 0) = 0 (Timeout)
17381 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL <unfinished ...>
17389 <... pselect6 resumed>)           = 0 (Timeout)
17383 <... pselect6 resumed>)           = 0 (Timeout)
17389 poll([{fd=3, events=POLLIN}], 1, 0) = 0 (Timeout)
17383 sendto(6, "{\"status\": \"running\", \"time\": 58"..., 201, 0, NULL, 0 <unfinished ...>
17374 <... pselect6 resumed>)           = 0 (Timeout)
17389 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL <unfinished ...>
17383 <... sendto resumed>)             = 201
11779 <... epoll_wait resumed>[{events=EPOLLIN, data={u32=13, u64=140733193388045}}], 6, 49) = 1
17374 poll([{fd=3, events=POLLIN}], 1, 0 <unfinished ...>
17383 poll([{fd=3, events=POLLIN}], 1, 0 <unfinished ...>
17374 <... poll resumed>)               = 0 (Timeout)
17383 <... poll resumed>)               = 0 (Timeout)
17374 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL <unfinished ...>
11779 recvfrom(13,  <unfinished ...>
17383 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL <unfinished ...>
11779 <... recvfrom resumed>"{\"status\": \"running\", \"time\": 58"..., 262144, 0, NULL, NULL) = 201
11779 epoll_wait(7, [], 6, 0)           = 0
11779 epoll_wait(7, 

The version of Python used is 3.9.9 and I'm using the latest avocado release, 96.0.

apteryks avatar May 04 '22 21:05 apteryks

I suspect multiprocessing.SimpleQueue never received the fix mentioned above that was made against queue.SimpleQueue, looking at CPython's source.

apteryks avatar May 05 '22 03:05 apteryks

Yes, I am not sure the issue was fixed either, on our side we have simply disabled some logging and other behaviors in order to adapt to this.

pevogam avatar May 05 '22 04:05 pevogam

@pevogam I see :-/. The test suite also hangs the same using the latest commit, 32626243d969ad4bc62a26e006e9e0e43e738411.

apteryks avatar May 05 '22 11:05 apteryks

I suspect multiprocessing.SimpleQueue never received the fix mentioned above that was made against queue.SimpleQueue, looking at CPython's source.

I can confirm that, on the side I was thinking of testing a quick diff like

-            queue = multiprocessing.SimpleQueue()
+            import queue
+            queue = queue.SimpleQueue()

on multiple locations where the multiprocessing.SimpleQueue was used while the python fix was actually for queue.SimpleQueue.

pevogam avatar Jun 07 '22 03:06 pevogam