zstd
zstd copied to clipboard
"./zstreamtest --newapi" test case fails on Windows
In the Meson WrapDB integration CI, we try to run the build && test on all 3 major OSes. On Windows, one of the tests fails. Logs: https://github.com/mesonbuild/wrapdb/runs/6086663524?check_suite_focus=true
Relevant bits:
3/8 test-zstream-3 FAIL 21.14s (exit status 3221225477 or signal 3221225349 SIGinvalid)
>>> MALLOC_PERTURB_=124 D:\a\wrapdb\wrapdb\_build\subprojects\zstd-1.5.2\build/meson/tests\zstreamtest.exe --newapi -t1 -T90s --no-big-tests
------------------------------------- 8< -------------------------------------
stderr:
Starting zstream tester (64-bits, 1.5.2)
Seed = 1643
1/ 1
34
37
41
45
47
60
78
82
95
102
132
133
134
143
145
150
159
171
196
201
202
209
216
225
249
257
263
267
272
275
287
297
310
313
318
339
348
356
362
374
381
413
414
433
458
484
491
------------------------------------------------------------------------------
It is dying of a strange signal apparently. I've grepped around the source tree, and --newapi is only exercised by the Makefile and by the meson.build, and neither one is getting run on Windows-based CI jobs.
@eli-schwartz are you able to reproduce the error consistently?
My only windows test environment is GitHub actions. In that environment I could consistently produce these results while doing debug iteration. (This is the same debug iteration in which I discovered the valgrind issue I was facing etc. in my other PR.)
It was also reproducible at the end of January (again in the WrapDB github actions CI), as I mentioned this exact problem here: https://github.com/facebook/zstd/pull/3039#issuecomment-1025331166
I've tried to reproduce it.
On macos or Linux, I was unable to reproduce the same issue.
On Windows + Mingw64, I could finally reproduce it. Problem is, it's not reproducible : a problem will happen, at some point, but it's never the same place. Traces are not helpful to debug the issue. There is no way to use something like a sanitizer in this environment.
Suspecting an issue with the multi-threading code currently, as the MT code on windows uses a shim translation layer, not pthread directly.
To check this hypothesis, I compile and run the same test, but under Windows + Msys2.
The difference is fairly small, almost the same system as mingw64, but msys2 uses a posix compatibility layer, so programs compiled in this mode can invoke pthread directly, without employing zstd's windows translation layer.
And sure enough, this test worked flawlessly, several times.
So that's a potential investigation direction.
Also worth noting :
under Windows + mingw64, the test fails only when MALLOC_PERTURB_ is set. Otherwise, it completes successfully.
This seems to point at MT + Windows + malloc combination problem.
Note that in the preamble to my CI logs I linked above, it prints messages pointing out that the detected / used compiler is cl.exe (msvc 19.31.31105) and the linker is link.exe -- and of course there too it is not using the msys2 posix compatibility layer.
Thanks for confirming the issue. :) The fact that it happens due to MALLOC_PERTURB_ is interesting... Meson sets this as described at https://mesonbuild.com/Unit-tests.html#malloc_perturb_ and it can be disabled if needed, though catching malloc problems does seem like that feature was useful in exposing an issue?
Yes, I think MALLOC_PERTURB_ is useful for tests, and it's worthwhile fixing the issues it finds: #3121.
Also worth noting : under Windows + mingw64, the test fails only when
MALLOC_PERTURB_is set. Otherwise, it completes successfully.
I retract that statement.
Even without MALLOC_PERTURB_, there are still failures. They just seem to take longer to generate, but they still happen.
This seems to point at a more subtle issue in the Windows pthread translation layer.
This was improperly closed. #3288 doesn't fix #3119.
This is properly fixed by #3364 (not yet merged, but passes this test case in CI now).
Fix has been merged. Thanks!