zstd icon indicating copy to clipboard operation
zstd copied to clipboard

"./zstreamtest --newapi" test case fails on Windows

Open eli-schwartz opened this issue 3 years ago • 8 comments

In the Meson WrapDB integration CI, we try to run the build && test on all 3 major OSes. On Windows, one of the tests fails. Logs: https://github.com/mesonbuild/wrapdb/runs/6086663524?check_suite_focus=true

Relevant bits:

3/8 test-zstream-3           FAIL            21.14s   (exit status 3221225477 or signal 3221225349 SIGinvalid)
>>> MALLOC_PERTURB_=124 D:\a\wrapdb\wrapdb\_build\subprojects\zstd-1.5.2\build/meson/tests\zstreamtest.exe --newapi -t1 -T90s --no-big-tests
------------------------------------- 8< -------------------------------------
stderr:
Starting zstream tester (64-bits, 1.5.2)

Seed = 1643


     1/     1    
    34          
    37          
    41          
    45          
    47          
    60          
    78          
    82          
    95          
   102          
   132          
   133          
   134          
   143          
   145          
   150          
   159          
   171          
   196          
   201          
   202          
   209          
   216          
   225          
   249          
   257          
   263          
   267          
   272          
   275          
   287          
   297          
   310          
   313          
   318          
   339          
   348          
   356          
   362          
   374          
   381          
   413          
   414          
   433          
   458          
   484          
   491          
------------------------------------------------------------------------------

It is dying of a strange signal apparently. I've grepped around the source tree, and --newapi is only exercised by the Makefile and by the meson.build, and neither one is getting run on Windows-based CI jobs.

eli-schwartz avatar Apr 27 '22 00:04 eli-schwartz

@eli-schwartz are you able to reproduce the error consistently?

terrelln avatar Apr 27 '22 17:04 terrelln

My only windows test environment is GitHub actions. In that environment I could consistently produce these results while doing debug iteration. (This is the same debug iteration in which I discovered the valgrind issue I was facing etc. in my other PR.)

eli-schwartz avatar Apr 27 '22 18:04 eli-schwartz

It was also reproducible at the end of January (again in the WrapDB github actions CI), as I mentioned this exact problem here: https://github.com/facebook/zstd/pull/3039#issuecomment-1025331166

eli-schwartz avatar Apr 27 '22 18:04 eli-schwartz

I've tried to reproduce it. On macos or Linux, I was unable to reproduce the same issue.

On Windows + Mingw64, I could finally reproduce it. Problem is, it's not reproducible : a problem will happen, at some point, but it's never the same place. Traces are not helpful to debug the issue. There is no way to use something like a sanitizer in this environment.

Suspecting an issue with the multi-threading code currently, as the MT code on windows uses a shim translation layer, not pthread directly. To check this hypothesis, I compile and run the same test, but under Windows + Msys2. The difference is fairly small, almost the same system as mingw64, but msys2 uses a posix compatibility layer, so programs compiled in this mode can invoke pthread directly, without employing zstd's windows translation layer. And sure enough, this test worked flawlessly, several times.

So that's a potential investigation direction.

Also worth noting : under Windows + mingw64, the test fails only when MALLOC_PERTURB_ is set. Otherwise, it completes successfully.

This seems to point at MT + Windows + malloc combination problem.

Cyan4973 avatar Apr 28 '22 03:04 Cyan4973

Note that in the preamble to my CI logs I linked above, it prints messages pointing out that the detected / used compiler is cl.exe (msvc 19.31.31105) and the linker is link.exe -- and of course there too it is not using the msys2 posix compatibility layer.

Thanks for confirming the issue. :) The fact that it happens due to MALLOC_PERTURB_ is interesting... Meson sets this as described at https://mesonbuild.com/Unit-tests.html#malloc_perturb_ and it can be disabled if needed, though catching malloc problems does seem like that feature was useful in exposing an issue?

eli-schwartz avatar Apr 28 '22 03:04 eli-schwartz

Yes, I think MALLOC_PERTURB_ is useful for tests, and it's worthwhile fixing the issues it finds: #3121.

Cyan4973 avatar Apr 28 '22 04:04 Cyan4973

Also worth noting : under Windows + mingw64, the test fails only when MALLOC_PERTURB_ is set. Otherwise, it completes successfully.

I retract that statement. Even without MALLOC_PERTURB_, there are still failures. They just seem to take longer to generate, but they still happen. This seems to point at a more subtle issue in the Windows pthread translation layer.

Cyan4973 avatar Apr 28 '22 16:04 Cyan4973

This was improperly closed. #3288 doesn't fix #3119.

Cyan4973 avatar Oct 15 '22 08:10 Cyan4973

This is properly fixed by #3364 (not yet merged, but passes this test case in CI now).

eli-schwartz avatar Dec 17 '22 23:12 eli-schwartz

Fix has been merged. Thanks!

eli-schwartz avatar Dec 20 '22 00:12 eli-schwartz