test_stack fail on linux/arm64 (clang)
Build: https://travis-ci.org/github/ivmai/libatomic_ops/jobs/670012675 Source: master (https://github.com/ivmai/libatomic_ops/commit/0cb237483150405296e94ac0fffb85b39f19035b) Compiler: clang version 3.8.0-2ubuntu4 Target: Linux/arm64 (aarch64-unknown-linux-gnu) How to reproduce: configure --enable-assertions && make -j check CFLAGS_EXTRA="-O3" CC=clang Reproduction rate: ~1/20
Not reproduced locally with clang-12 (RPi4, Ubuntu 21.04/arm64) on latest master (1668e46c2), ~100 runs.
Reproduced on Travis CI (latest master, commit 1668e46c2): https://app.travis-ci.com/github/ivmai/libatomic_ops/jobs/554045328 (1st test run)
Compiler: clang version 3.8.0-2ubuntu4 Config: CONF_OPTIONS="--enable-assertions"
Output (test_stack): Missing list element <...>
Not reproduced w/o assertions or w/o (-O2 or -O3). Not reproduced with gcc. Not reproduced if AO_STACK_IS_LOCK_FREE.
Reproduced on latest master (401f7b773) Compiler: clang version 3.8.1-24 (tags/RELEASE_381/final) Target: Linux/arm64 (aarch64-unknown-linux-gnu) Host: gcc117 (farm) Occurrence: ~1/45
Build: https://app.travis-ci.com/github/ivmai/libatomic_ops/jobs/559848368 Source: master (6be0bc83c) Target: aarch64-unknown-linux-gnu Compiler: clang version 3.8.0-2ubuntu4 Config: CFLAGS_EXTRA="-O3 -D N_EXPERIMENTS=10" CONF_OPTIONS="--enable-assertions" Output (test_stack): Failed - nothing to pop Use almost-lock-free implementation
Reproduced on gcc117 (farm). Source: master (6be0bc83c20389e56c679e1f20927ee9d6da420d) Compiler: clang version 3.8.1-24 (tags/RELEASE_381/final) Target: Linux/arm64 (aarch64-unknown-linux-gnu) How to: ./configure --enable-assertions && make -j check CFLAGS_EXTRA="-D N_EXPERIMENTS=2" CC=clang Occurrence: ~1/10
Also reproduced with: clang -O3 -D N_EXPERIMENTS=2 -I src tests/test_stack.c src/*.c -lpthread (if make -j check is running repeatedly in parallel)
Output (test_stack): Use almost-lock-free implementation Missing list element <...>
Also, maybe the existing fix is not correct: 7ce7da4fe
I've reviewed the code in test_stack.c - I don't see any bug with the test code.
After preprocessing of atomic_ops_stack.c, the code looks like:
void AO_stack_push_explicit_aux_release(volatile AO_t *list, AO_t *x, AO_stack_aux *a)
{
AO_t x_bits = (AO_t)x;
AO_t next;
retry:
{
AO_t entry1 = AO_load(&a->AO_stack_bl[0]);
AO_t entry2 = AO_load(&a->AO_stack_bl[1]);
if (entry1 == x_bits || entry2 == x_bits) {
++x_bits;
if ((x_bits & 7) == 0)
x_bits = (AO_t)x;
goto retry;
}
}
do {
next = AO_load(list);
*x = next;
} while (!AO_compare_and_swap_release(list, next, x_bits));
}
AO_t *AO_stack_pop_explicit_aux_acquire(volatile AO_t *list, AO_stack_aux *a)
{
unsigned i;
int j = 0;
AO_t first;
AO_t *first_ptr;
AO_t next;
retry:
first = AO_load(list);
if (0 == first) return 0;
for (i = 0;;) {
if (AO_compare_and_swap_acquire(a->AO_stack_bl+i, 0, first))
break;
if (++i >= 2) {
i = 0;
AO_pause(++j);
}
}
assert(a->AO_stack_bl[i] == first);
if (first != AO_load(list)) { /* or AO_load_acquire as WA for powerpc */
AO_store_release(a->AO_stack_bl+i, 0);
goto retry;
}
first_ptr = (AO_t *)(first & ~7);
next = AO_load(first_ptr);
if (!AO_compare_and_swap_release(list, first, next)) {
AO_store_release(a->AO_stack_bl+i, 0);
goto retry;
}
assert(*list != first);
AO_store_release(a->AO_stack_bl+i, 0);
return first_ptr;
}
Hello @hboehm, Could you please give some hints regarding this test failure? It relates to almost-non-blocking implementation. I don't see any issue in test_stack.c code itself.
Some questions (based on the preprocessed code above):
- stack_push: AO_t entry1 = AO_load(&a->AO_stack_bl[0]): is no acq barrier OK?
- stack_push: while (!AO_compare_and_swap_release(list, next, x_bits)): isn't it better to goto retry on CAS fail?
- stack_push: do { ... : what would happen if a long pause occurs between ensuring entry
!= x_bits and next = AO_load(list)? - stack_pop: first = AO_load(list): is no acq barrier OK?
- Once I saw violation of assert(*list != first): is this possible provided the AO primitives are correct?
PS. I haven't looked into the generated code yet (unfortunately I managed to reproduced it only with clang 3.8).
if (AO_EXPECT_FALSE(first != AO_load_acquire(list))) /* Workaround test failure on AIX, at least, by */
Removing this W/A (AO_load_acquire -> AO_load) in AO_stack_pop_explicit_aux_acquire leads to almost 100% violation in assert(*list != first).
No test failure after W/A commit c664a61 But is there any better fix?