zstd icon indicating copy to clipboard operation
zstd copied to clipboard

[Non-repro][Windows 10]:Crash in ZSTD_pthread_cond_broadcast/WakeAllConditionVariable on freeing multi-threaded compression context

Open vnair81 opened this issue 5 years ago • 8 comments

Our application uses multi-threaded compression with default compression level (3) and sets nbWorkers to number of logical cores on the machine. We create the compression context on first use and reuse it throughout the session and free it on termination. All our compression calls are issued from the main thread only.

We're getting crash reports from the field that some users are encountering a crash in the main thread during this cleanup inside the function POOL_join, in the second call to ZSTD_pthread_cond_broadcast (WakeAllConditionVariable on Windows). We haven't been able to find steps to reproduce this problem.

Crash stack from the dump looks like the following:

ntdll.dll!00007ffcd1b6cc37()	Unknown
[Inline Frame] AppName.exe!POOL_join(POOL_ctx_s *) Line 168	C
AppName.exe!POOL_free(POOL_ctx_s * ctx) Line 178	C
AppName.exe!ZSTDMT_freeCCtx(ZSTDMT_CCtx_s * mtctx) Line 966	C
AppName.exe!ZSTD_freeCCtxContent(ZSTD_CCtx_s * cctx) Line 138	C
AppName.exe!ZSTD_freeCCtx(ZSTD_CCtx_s * cctx) Line 149	C

We can't share the dump files due to legal and privacy concerns but following are the debugger views of the source and disassembly of the main and pool thread.

Main Thread: MainThread

Main Thread Disassembly: MainThreadAssembly

Pool Thread: PoolThread

Pool Thread Disassembly: PoolThreadAssembly

We are on zstd release 1.4.4 and using Visual Studio 2017 (15.9) on Windows and building for x86_64.

An observation that has been true for all dumps we've examined so far, is that some of the pool threads seem to have exited by the time the crash happens. For example, in the above dump, nbWorkers was set as 4 but at the time of the crash only one pool thread was still active. Rest 3 had exited.

Any insights on how to resolve this would be highly appreciated. Thanks.

vnair81 avatar Aug 10 '20 13:08 vnair81

I suspect this is a Windows specific issue, though it's not obvious what is happening.

terrelln avatar Aug 10 '20 17:08 terrelln

One thing to try is to compile zstd with asserts enabled: by defining DEBUGLEVEL to 1 (or higher to get progressively more logging).

terrelln avatar Aug 10 '20 17:08 terrelln

@terrelln if we could reproduce this issue at our end, increasing logging would have helped. However we're only getting this from the field. We do have DEBUGLEVEL set as 1 for our debug builds, but have never encountered this in-house.

Also in this case number of crashes is almost equal to number of users affected, so hardly ever does the same user encounters (or at least reports) this twice.

And yes this does appear to be a Windows specific issue. We have not encountered any similar report on Mac.

vnair81 avatar Aug 10 '20 18:08 vnair81

@vnair81 do you know what Visual Studios version is being used to compile, and what versions of Windows this error is occurring on (and any other potential variables)?

terrelln avatar Aug 10 '20 18:08 terrelln

Visual Studio 2017 (version 15.9.15) was used for building this. OS on which crash occurred was Windows 10 Pro (version 10.0.18363).

I have the full memory dump available, so we can examine any variable which is not optimized away (/O2 optimization was used).

ctx from POOL_join is as follows: MainThreadAutos

@terrelln does this help?

vnair81 avatar Aug 11 '20 05:08 vnair81

Not sure if it is relevant, but we previously found an issue (#1830) where the mutex and condition variable were being overwritten by a memset call.

Though that doesn't look to be the case here.

vnair81 avatar Aug 11 '20 06:08 vnair81

Was this fixed by #3051?

terrelln avatar Dec 16 '22 01:12 terrelln

Was this fixed by #3051?

I don't think so, as this looks like a memory access violation and not resource starvation. More likely it's the bug(s) that's being fixed in #3364. We won't be able to say for sure unless we have a reproducer.

yoniko avatar Dec 17 '22 03:12 yoniko

It seems like this could be fixed by our PR @yoniko, but it is hard to say.

@vnair81 We think we may have fixed the issue, please let us know if this is still happening on the latest dev branch.

terrelln avatar Dec 22 '22 02:12 terrelln

@terrelln @yoniko unfortunately we were not able to reproduce this issue in-house.

We haven't heard about or seen evidence of this issue's occurrence at customer's end in recent times, but I am not sure we can conclude that the issue is fixed based on that.

We're on v1.5.2 and would update with any newer release. BTW is a new release planned soon? It's almost been a year since the last one.

vnair81 avatar Dec 22 '22 04:12 vnair81

We are preparing a new release. No date set yet.

Cyan4973 avatar Dec 22 '22 17:12 Cyan4973

Thank you for reporting @vnair81. Closing the issue, please reopen if there are any further questions.

yoniko avatar Jan 10 '23 05:01 yoniko