libqb icon indicating copy to clipboard operation
libqb copied to clipboard

broken `test_ipc_max_dgram_size` test needs to be reviewed

Open jnpkrn opened this issue 8 years ago • 14 comments

https://travis-ci.org/ClusterLabs/libqb/jobs/178242766#L2722

../../tests/check_ipc.c:1498:F:ipc_max_dgram_size:test_ipc_max_dgram_size:0: Assertion 'init==try' failed: init==331264, try==425728

...triggered intermittently only with clang (3.4), upon unrelated change.

jnpkrn avatar Nov 23 '16 16:11 jnpkrn

Build image provisioning date and time
Thu Feb  5 15:09:33 UTC 2015

Operating System Details

Distributor ID:	Ubuntu
Description:	Ubuntu 12.04.5 LTS
Release:	12.04
Codename:	precise

Linux Version
3.13.0-29-generic

Cookbooks Version
a68419e https://github.com/travis-ci/travis-cookbooks/tree/a68419e

jnpkrn avatar Nov 23 '16 16:11 jnpkrn

Verified this is indeed intermittent, the above link now points to a restarted run, which passed (note that the offset of the respective line is +3, unfortunately I haven't grabbed these 3 extraneous lines when it was possible, which might have shed more light into this, supposing they were related error messages).

jnpkrn avatar Nov 23 '16 17:11 jnpkrn

One of the possibilities that are hard to rule out is that parallel matrix builds (e.g., multiple compilers) share the same /dev/shm path (containers set up like that?) and it doesn't play very well in some rare circumstances as similar pseudorandom paths are being accessed...

jnpkrn avatar Nov 23 '16 17:11 jnpkrn

Very odd. I'm not going to worry about it short-term, though it would be useful to know how the test systems are set up. Can we reproduce it with clang ourselves?

chrissie-c avatar Nov 24 '16 11:11 chrissie-c

No cycles to spend on trying to reproduce that though we are now aware about this inclination in Travis CI so we'll have at least some clues when/if this recidivate.

jnpkrn avatar Nov 24 '16 12:11 jnpkrn

Some archeology:

  • there was once a magic issue 155 of travis-ci/travis-cookbooks, not available anymore (Issues got disabled for that project) . Angus once added this as a workaround: https://github.com/ClusterLabs/libqb/commit/ffdc2d519ab7f30e7d16490b37fe2fa08be00c3d#diff-354f30a63fb0907d4ad57269548329e3R6 . which we dropped once we migrated from "legacy infrastructure" of Travis CI: https://github.com/ClusterLabs/libqb/commit/7cd90f11dbb24aea513d78148ca5ac4eca888438#diff-354f30a63fb0907d4ad57269548329e3L4 . there are not many references to this 155 issue, but of particular interest: https://github.com/ogrisel/spylearn/blob/b39d5f383dce1b0303860403fb1bd218360e7e21/.travis.yml#L14
    Workaround for Travis issue with POSIX semaphores
    
    . currently, there is something similar as the "workaround" for Python environment https://github.com/travis-ci/travis-cookbooks/blob/64ff883360f3d265b87c072a07f78e9ef0a874fb/cookbooks/travis_python/recipes/devshm.rb#L24 (refers to the cookbook used in the affected run) . ...which dates back to this commit https://github.com/travis-ci/travis-cookbooks/commit/06be5a5139ae9f39c7e5831b6bad9a38d8bd5844#diff-c06d560f7d314b365d34ead2be8824daR23 which may correlate, in timing, with this magic issue 155 at hand

jnpkrn avatar Nov 24 '16 14:11 jnpkrn

One more relevant hit: http://lists.corosync.org/pipermail/discuss/2013-May/002573.html

One quick thing to check is the location of your shared memory
I use travis ci for libqb and travis uses ubuntu vm's and I
know I had to do a workaround for the shared memory location
being moved from /dev/shm to /run/shm.

See: https://github.com/asalkeld/libqb/blob/master/.travis.yml

I'd suggest have a look at the output of:
mount | grep shm
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,seclabel)

df -h | grep shm
tmpfs                    3.9G  2.9M  3.9G   1% /dev/shm

and see if you need to run that workaround. (libqb tries /dev/shm
first).

jnpkrn avatar Nov 24 '16 14:11 jnpkrn

Regarding the relevance to Python implied with the cookbook references above, http://stackoverflow.com/a/30175343 seems to suggest it was to solve some kind of issue with multiprocessing module in Python's standard library.

jnpkrn avatar Nov 24 '16 15:11 jnpkrn

(see also #238)

jnpkrn avatar Nov 28 '16 14:11 jnpkrn

Diagnostic enhancement from #238 shed some more light here:

../../tests/check_ipc.c:1506:F:ipc_max_dgram_size:test_ipc_max_dgram_size:0: Assertion 'init==try' failed: init==0x50e00, try==0x67f00, i=28, errno=90

where errno of 90 means EMSGSIZE (Message too long).

One of the possibities is that some assumption that used to hold so far (per the previous successful test runs) is actually unreliable in practice and some factors of Travis environment just make it easier to prove it.

jnpkrn avatar Nov 29 '16 17:11 jnpkrn

Another hit:

init==0x50e00, try==0x67f00, i=40, errno=90

From the diagnostics added so far, it seems that /dev/shm mounted as tmpfs is quite small, just 64 MB, if it could be a culprit.

jnpkrn avatar Dec 12 '16 11:12 jnpkrn

... PR #242 might help regarding this hypothesis.

jnpkrn avatar Dec 12 '16 12:12 jnpkrn

Just got a report with occurrence of this issue on virtualized s390x:

ipc_max_dgram_size:test_ipc_max_dgram_size:0: Assertion 'init==try' failed: init==331264, try==331776

Mere 495M was allocated to /dev/shm.

jnpkrn avatar Apr 06 '17 14:04 jnpkrn

It's testing socket buffers rather than SHM arenas so it might be a ulimit issue. Odd that it failed there though because that's comparing the reported maximum with the actual allocated!

chrissie-c avatar Apr 11 '17 09:04 chrissie-c