rubicon-objc icon indicating copy to clipboard operation
rubicon-objc copied to clipboard

Async tests sometimes randomly cause errors

Open dgelessus opened this issue 5 years ago • 3 comments

The async tests (tests/test_async.py) sometimes cause errors for no apparent reason. For example, this error message appears every now and then for me locally and on Travis:

test_call_at (tests.test_async.AsyncCallTests) ... Traceback (most recent call last):
  File "_ctypes/callbacks.c", line 234, in 'calling callback function'
  File ".../rubicon-objc/rubicon/objc/eventloop.py", line 197, in _cf_socket_callback
    callback, args = self._writer
TypeError: 'NoneType' object is not iterable

This is an exception that was raised from inside a ctypes callback function, which means that it cannot propagate normally. When this happens, ctypes prints out the exception, but it does not terminate Python like uncaught exceptions normally do, and the test in question does not count as failed. Because of this, the error has been more or less ignored so far (or perhaps not even noticed when it appeared).

Another error that I've just come across is this:

FAIL: test_run_forever (tests.test_async.AsyncRunTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/travis/build/beeware/rubicon-objc/tests/test_async.py", line 56, in test_run_forever
    self.assertEqual(results2, [0, 1, 2, 3])
AssertionError: Lists differ: [0, 1, 2] != [0, 1, 2, 3]
Second list contains 1 additional elements.
First extra element 3:
3
- [0, 1, 2]
+ [0, 1, 2, 3]
?         +++

Unlike the first error, this is an actual test failure.

I'm really not familiar with the async-related parts of Rubicon, so I don't know what could be going wrong here. I'm not sure if these two errors are even related, but I'm reporting them together, because they are both related to some sort of timing problem or race condition that is not noticeable very often.

dgelessus avatar May 25 '19 00:05 dgelessus

@dgelessus I'm not 100% convinced these two are the same problem. The second one could be explained by task scheduling; test_run_forever is predicated on the idea that the "do_stuff" will start immediately, and the amount of work scheduled for the task should take less than 0.5 seconds - but if, for some reason, the longer do_stuff task isn't scheduled immediately, it might end up taking longer (in wall clock time) to execute, and exceed the 0.5 second delay that stop_loop() is enforcing.

That might be caused by some manifestation of the first problem - but I think it's just as likely as some random occasional slowdown in signal processing (and thus async task allocation). If we were to increase the timeout to something more than 0.5 seconds, I'd suspect we'd see that test failure disappear.

As for the first failure - I think you're right that it's a race condition, I'd guess something around the cancellation of socket handles and/or the disabling of the socket write operation. The naïve fix would be to check to see if self._writer is None before unpacking; if it is, then callback can be set to None, which means the callback will be a no-op. I would be interested to see if this fix starts exposing a reliable test failure, rather than 'silent failure with exception' that we're currently seeing.

In both cases - What sort of frequency do you see these failures? Do you have any reliably way of increasing the frequency of the failure? e.g., is it more frequent if your CPU or network is under load? Or if the disk is thrashing? I've tried reproducing them locally by running the async tests over and over again, but I haven't seen any manifestation of the problem...

freakboy3742 avatar May 25 '19 04:05 freakboy3742

The first error occurs quite often for me, it usually only takes one or two full Tox runs for it to appear for at least one Python version. (It's not specific to any particular Python version - testing all of them simply makes it more likely to appear.) My machine isn't under any unusual load (the only major CPU load is the test suite, disk activity is not an issue since I have an SSD, and there's no noticeable network activity). The error also appears on Travis sometimes, which should also have pretty average system load.

The second error is new to me and I haven't been able to reproduce it properly. I saw it in one of the Travis builds for #136, but the original log is now gone since I restarted the build in question.

dgelessus avatar May 25 '19 11:05 dgelessus

FYI - the first of these two errors ('NoneType' object is not iterable) is corrected by #238. I'm fairly certain the second problem is entirely down to task scheduling, and as such, probably isn't fixable. On that basis, I'm not sure if we should close this bug "as fixed as it's going to get", or leave it open in case anyone sees the behaviour in future.

freakboy3742 avatar Jan 20 '23 02:01 freakboy3742