ola icon indicating copy to clipboard operation
ola copied to clipboard

Python ClientWrapperTest fails occasionally

Open danielskeenan opened this issue 4 years ago • 9 comments
trafficstars

This has apparently been a known issue for a while, but I've narrowed down the circumstances it can happen in while working on OpenLightingProject/website#14. Warning - it's a bit involved.

When protobuf is installed from pip (i.e. pip3 install protobuf), ClientWrapperTest fails. When protobuf is installed from the system package manager (i.e. apt-get install python3-protobuf), the test passes. Checked with the latest commit in master (a95a6f7 currently) with Ubuntu 20.10 and Ubuntu 20.04. I haven't properly checked Fedora yet (only a cursory test while writing docs), but I suspect it will have the same issue for reasons detailed below.

Some speculation:

My first thought was an issue between Python2 packages and Python3 packages. These test systems only have Python3, so that's not it.

Further research reveals that GNOME distros (or, at least, Ubuntu and Fedora) come with gnome-online-accounts, which itself depends on python3-protobuf. This means that GNOME users will likely have protobuf for python already installed on their systems. If they install deps from pip (e.g. with pip3 install protobuf numpy), pip won't install another protobuf as it sees the system version already. This also explains why the issue occurs on headless systems (i.e. Travis), as those systems won't have the preinstalled python protobuf and will install it from pip if you ask.

But what's the difference between the two sources? My instinct is that something changed in newer Protobuf releases (I know this is a regular problem here). The latest release on pip is 3.14.0. Ubuntu 20.10 will get you 3.12.3. Ubuntu 20.04 provides 3.6.1. Fedora 33 has 3.12.4. Fedora 32 has 3.11.2.

To confirm this theory, I installed protobuf 3.12.4 from pip and ran tests, which all passed. protobuf 3.13.0 also passes. This leads me to conclude something about protobuf 3.14 is causing problems. Caveat - I havn't tested any of this with Python2.

Test log:

==================================
   OLA 0.10.8: ./test-suite.log
==================================

# TOTAL: 95
# PASS:  94
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: python/ola/ClientWrapperTest.sh
=====================================

...F/usr/lib/python3.8/unittest/case.py:704: ResourceWarning: unclosed <socket.socket fd=4, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0>
  outcome.errors.clear()
ResourceWarning: Enable tracemalloc to get the object allocation traceback
.
======================================================================
FAIL: testEventLoop (__main__.ClientWrapperTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "./python/ola/ClientWrapperTest.py", line 195, in testEventLoop
    self.assertTrue(results.d_called - self.start >=
AssertionError: False is not true

----------------------------------------------------------------------
Ran 5 tests in 0.039s

FAILED (failures=1)
FAIL python/ola/ClientWrapperTest.sh (exit status: 1)

danielskeenan avatar Feb 18 '21 18:02 danielskeenan

Well investigated. However, it might be related to something else. On my machine (Gentoo Linux, protobuf-3.14.0 installed via system package manager), the test passes:

PASS: python/ola/ClientWrapperTest.sh

Protobuf (and its Python bindings) are already "stable" on Gentoo Linux (at least for amd64): https://packages.gentoo.org/packages/dev-libs/protobuf and https://packages.gentoo.org/packages/dev-python/protobuf-python. I do have pip installed but I actually only use it if I have to. Python versions installed are:

2.7.18_p7(2.7)(09:40:44 02/17/21)(bluetooth examples gdbm ipv6 ncurses readline sqlite ssl threads tk wide-unicode xml -berkdb -build -hardened -libressl -verify-sig -wininst ELIBC="-uclibc")
3.8.8_rc1(3.8)^t(14:23:18 02/18/21)(bluetooth examples gdbm ipv6 ncurses readline sqlite ssl tk xml -build -hardened -libressl -test -verify-sig -wininst)
3.9.2_rc1(3.9)^t(14:26:52 02/18/21)(bluetooth examples gdbm ipv6 ncurses readline sqlite ssl tk xml -build -hardened -libressl -test -verify-sig -wininst)

But basically no package (including protobuf) install bindings for Python 2 anymore.

Configuration of ola-0.10.8 goes like that:

./configure --prefix=/usr --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --mandir=/usr/share/man --infodir=/usr/share/info --datadir=/usr/share --sysconfdir=/etc --localstatedir=/var/lib --disable-dependency-tracking --disable-silent-rules --docdir=/usr/share/doc/ola-0.10.8 --htmldir=/usr/share/doc/ola-0.10.8/html --with-sysroot=/ --libdir=/usr/lib64 --disable-fatal-warnings --disable-osc --enable-examples --enable-libftdi --enable-ftdidmx --enable-http --enable-python-libs --enable-libusb
...
checking whether /usr/bin/python3.8 version is >= 2.6... yes
checking for /usr/bin/python3.8 version... 3.8
checking for /usr/bin/python3.8 platform... linux
checking for /usr/bin/python3.8 script directory... ${prefix}/lib/python3.8/site-packages
checking for /usr/bin/python3.8 extension module directory... ${exec_prefix}/lib/python3.8/site-packages
checking for python3.8 module: google.protobuf... yes
checking for flake8... no
checking for libprotobuf... yes
checking for protoc... /usr/bin/protoc
checking protoc version... 3.14.0
...

Gentoo Linux is a bit exotic for people not used to it. However, if you want I can provide you a Dockerfile so you can reproduce and/or continue investigating with that if you'd like to.

kripton avatar Feb 18 '21 18:02 kripton

Just tested also with the latest master (commit a95a6f7d43ac02b1b0103e99f24cfd8d7eec28fb) and the tests are also all PASSing. However, my system runs fewer tests than yours:

# TOTAL: 94
# PASS:  94

kripton avatar Feb 18 '21 19:02 kripton

That matches my experience on Fedora with protobuf 3.14, so it appears the issue is confined to Ubuntu. Going to start digging through changelogs.

danielskeenan avatar Feb 18 '21 19:02 danielskeenan

I tweaked the failing test a bit to get better info:

diff --git a/python/ola/ClientWrapperTest.py b/python/ola/ClientWrapperTest.py
index cd1e5cf2d..a24041d15 100644
--- a/python/ola/ClientWrapperTest.py
+++ b/python/ola/ClientWrapperTest.py
@@ -186,13 +186,13 @@ class ClientWrapperTest(unittest.TestCase):
     self.assertIsNotNone(results.c_called)
     self.assertIsNotNone(results.d_called)
 
-    self.assertTrue(results.a_called - self.start <
+    self.assertLess(results.a_called - self.start,
                     datetime.timedelta(milliseconds=5))
-    self.assertTrue(results.b_called - self.start >=
+    self.assertGreaterEqual(results.b_called - self.start,
                     datetime.timedelta(milliseconds=5))
-    self.assertTrue(results.c_called - self.start >=
+    self.assertGreaterEqual(results.c_called - self.start,
                     datetime.timedelta(milliseconds=10))
-    self.assertTrue(results.d_called - self.start >=
+    self.assertGreaterEqual(results.d_called - self.start,
                     datetime.timedelta(milliseconds=15))
 
     sockets[0].close()

With the new test that says what the difference is, it fails with:

AssertionError: datetime.timedelta(microseconds=14612) not greater than or equal to datetime.timedelta(microseconds=15000)

However, I ran the same test a couple more times, without changing anything, and it passed. Are we just dealing with system scheduler variances?

danielskeenan avatar Feb 18 '21 20:02 danielskeenan

Glad you're investigating. Though I don't know what protobuf could have to do with either of those failures.

On Thu, Feb 18, 2021 at 1:09 PM Dan Keenan [email protected] wrote:

FAIL: python/ola/ClientWrapperTest.sh

...F/usr/lib/python3.8/unittest/case.py:704: ResourceWarning: unclosed <socket.socket fd=4, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0> outcome.errors.clear() ResourceWarning: Enable tracemalloc to get the object allocation traceback ..

Glancing at the tests, the last test (testSend) doesn't have the sockets[0].close() sockets[1].close() at the end. That looks like my mistake.

====================================================================== FAIL: testEventLoop (main.ClientWrapperTest)

Traceback (most recent call last): File "./python/ola/ClientWrapperTest.py", line 195, in testEventLoop self.assertTrue(results.d_called - self.start >= AssertionError: False is not true

Only way I see that happening would be a strange rounding error, but I'd be really surprised it's repeatable. Curious what d_called and start are. I considered putting in checks that were a bit less than the requested delays (so 14ms here) but never saw it fail.

Bruce

brucelowekamp avatar Feb 18 '21 20:02 brucelowekamp

(messages overlapped...)

I don't see how, as the check for when to call the task is explicit in the code (it's not just a sleep): return self._run_at < now but I wonder if there's some issue with timer precision underlying the python implementation giving different results in different places.

I would just subtract 1 from each of the comparisons and leave it at that.

Bruce

On Thu, Feb 18, 2021 at 3:26 PM Dan Keenan [email protected] wrote:

I tweaked the failing test a bit to get better info:

diff --git a/python/ola/ClientWrapperTest.py b/python/ola/ClientWrapperTest.py index cd1e5cf2d..a24041d15 100644--- a/python/ola/ClientWrapperTest.py+++ b/python/ola/ClientWrapperTest.py@@ -186,13 +186,13 @@ class ClientWrapperTest(unittest.TestCase): self.assertIsNotNone(results.c_called) self.assertIsNotNone(results.d_called)

  • self.assertTrue(results.a_called - self.start <+ self.assertLess(results.a_called - self.start, datetime.timedelta(milliseconds=5))- self.assertTrue(results.b_called - self.start >=+ self.assertGreaterEqual(results.b_called - self.start, datetime.timedelta(milliseconds=5))- self.assertTrue(results.c_called - self.start >=+ self.assertGreaterEqual(results.c_called - self.start, datetime.timedelta(milliseconds=10))- self.assertTrue(results.d_called - self.start >=+ self.assertGreaterEqual(results.d_called - self.start, datetime.timedelta(milliseconds=15))
 sockets[0].close()

With the new test that says what the difference is, it fails with:

AssertionError: datetime.timedelta(microseconds=14612) not greater than or equal to datetime.timedelta(microseconds=15000)

However, I ran the same test a couple more times, without changing anything, and it passed. Are we just dealing with system scheduler variances?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OpenLightingProject/ola/issues/1714#issuecomment-781613819, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADNHN5NWKT2RYBDWBEQSH5TS7VZXNANCNFSM4X2YUA5A .

brucelowekamp avatar Feb 18 '21 20:02 brucelowekamp

I've got a VM rebuilding the project on a single core to see what happens (it was difficult to get it to fail locally on my 12-core i7). Will work on a PR that relaxes the timing slightly.

danielskeenan avatar Feb 18 '21 20:02 danielskeenan

Oops, I thought I'd raised this elsewhere and tagged @brucelowekamp for some input but clearly not.

I hit the same thing on the Arm builds on Travis, but only confusingly at some point later the commits to get them merged worked fine, so I guess after some system change or similar, or package upgrade. I get (from https://travis-ci.org/github/OpenLightingProject/ola/jobs/755992758 which is a build of https://github.com/OpenLightingProject/ola/pull/1687):


FAIL: testEventLoop (__main__.ClientWrapperTest)

----------------------------------------------------------------------

Traceback (most recent call last):

  File "../../python/ola/ClientWrapperTest.py", line 193, in testEventLoop

    datetime.timedelta(milliseconds=5))

AssertionError: datetime.timedelta(0, 0, 4923) not greater than or equal to datetime.timedelta(0, 0, 5000)

peternewman avatar Feb 19 '21 01:02 peternewman

As @peternewman on https://github.com/OpenLightingProject/ola/pull/1716#issuecomment-782517417, tests are still failing occasionally (at least on ARM) even after #1716 was merged.

Given what DMX timings can be in popular products, it's probably safe to raise the allowed timing delta from 500 µs to 1000 µs (1 ms). I suppose the real question is what resolution we want to guarantee callbacks will run with? Trying to avoid changing tests when perhaps those tests are saying there's a problem.

danielskeenan avatar Feb 20 '21 14:02 danielskeenan