ZstdNet
ZstdNet copied to clipboard
Shared Compressor fails in multithreaded code
I compiled sources of ZstdNet for FW 4.8. Plain usage:
var compressor = new Compressor();
data = File.ReadAllBytes(f.Filename);
ComprData = compressor.Wrap(data);
...and immediately fails:
System.AccessViolationException HResult=0x80004003 Message=Attempted to read or write protected memory. This is often an indication that other memory is corrupt. Source=<Cannot evaluate the exception source> StackTrace: ZStdLibFW.dll!ZstdNet.Compressor.Wrap(System.ReadOnlySpan
src, System.Span dst) Line 90 C# ZStdLibFW.dll!ZstdNet.Compressor.Wrap(System.ReadOnlySpan src) Line 56 C# ZStdLibFW.dll!ZstdNet.Compressor.Wrap(byte[] src) Line 43 C#
It happens in Compressor.cs[90]:
var dstSize = Options.AdvancedParams != null
? ExternMethods.ZSTD_compress2(cctx, dst, (size_t)dst.Length, src, (size_t)src.Length)
: Options.Cdict == IntPtr.Zero
? ExternMethods.ZSTD_compressCCtx(cctx, dst, (size_t)dst.Length, src, (size_t)src.Length, Options.CompressionLevel)
: ExternMethods.ZSTD_compress_usingCDict(cctx, dst, (size_t)dst.Length, src, (size_t)src.Length, Options.Cdict);
Any idea why memory management fails?
@LexaGV Could you please provide a more complete minimal example that reproduces the problem? Is the problem still reproducible with the version from the NuGet? Is the problem still reproducible with another file (e.g. empty)
Also the following information may be useful:
- Sources commit revision
- What changes are made
- Build Target
- Debug/Release
- x86/x64
- Libzstd version
In general, the AccessViolationException may indicate that the Compressor/CompressionOptions was disposed before use (but that doesn't seem to be the case for the snippet above)
Nope, it's not nuget - I compiled my own .NET Framework 4.8 version. Seems I found what's a problem. Full example:
var compressor = new Compressor();
Parallel.ForEach(Disk.EnumFiles(@"C:\DEV\", ".vs", ".git", ".hg", "bin", "obj", "bin_Debug", "bin_Release"), ff => {
if (!ff.EndsWith(".cs")) return;
var data = File.ReadAllBytes(ff);
var comprData = compressor.Wrap(data);
Log.Debug("Compr len=" + comprData.Length);
});
THIS example fails w memory problem described above. But if you put var compressor = new Compressor(); before var comprData = compressor.Wrap(data);, everything works smooth. What makes me thinking Compressor cannot be shared across threads and it's not re-entrant. I know, it's not a huge problem - to create compressor every time you need, but I thought on HUGE list of files shared Compressor could improve performance.
Instances of the Compressor class are not thread safe (as stated in the readme). If you want to optimize bulk processing performance you can use for example
- Single
Compressorinstance withlock ThreadStaticorThreadLocal<T>instancesConcurrentBagof instances withSemaphoreSlim- Object pooling
- etc
Here is an example with ObjectPool<T>
// Read the docs: https://learn.microsoft.com/en-us/aspnet/core/performance/objectpool?view=aspnetcore-8.0
int parallelism = 4;
var pool = new DefaultObjectPool<Compressor>(new DefaultPooledObjectPolicy<Compressor>(), parallelism);
Parallel.ForEach(Directory.GetFiles(AppDomain.CurrentDomain.BaseDirectory, "*.cs"), new ParallelOptions {MaxDegreeOfParallelism = parallelism}, file =>
{
var data = File.ReadAllBytes(file);
var compressor = pool.Get();
try
{
var compressed = compressor.Wrap(data);
Console.WriteLine(compressed.Length);
}
finally
{
pool.Return(compressor);
}
});
Dmitry, thanks for such descriptive and helpful answer! The one thing I don't like is "dancing around" the class. Is it possible to make Compressor itself as a reentrant? Not sure what "status" you need to keep once you compress data. All necessary variables can be declared locally, so Compressor will be static class and all you need is just call "SqueezeMyData()"! :)
Adding another simplified API is a possible option. But it will not be as efficient because it will require create/free an unmanaged context on each wrap. Instances of the Compressor class allow to reuse an unmanaged context. Check also native zstd docs
When compressing many times, it is recommended to allocate a context just once, and reuse it for each successive compression operation. This will make workload friendlier for system's memory.
Note : re-using context is just a speed / resource optimization, it doesn't change the compression ratio, which remains identical.
Note 2 : In multi-threaded environments, use one different context per thread for parallel execution.
Dmitriy, thanks for explanation! I'm glad to chat with such qualified engineer with so deep knowledge. Then your solution is ideal, what is rare for FOSS. :) For my peace of mind I made a few tests - compress *.cs sources of FW. Sources were on the RAM drive (PF=Parallel.Foreach):
Default compression:
Pool+PF: Spent=00:00:00.627 on 18296 files, Avg ratio = 0.1898
PF: Spent=00:00:01.425 on 18296 files, Avg ratio = 0.1898
foreach: Spent=00:00:02.755 on 18296 files, Avg ratio = 0.1898
Max compression:
Pool+PF: Spent=00:00:30.078 on 18296 files, Avg ratio = 0.159
PF: Spent=00:00:30.275 on 18295 files, Avg ratio = 0.159
foreach: Spent=00:02:12.764 on 18296 files, Avg ratio = 0.159
(Looks like pool gives nothing when each thread works long enough time. And max compression almost identical to default one, but takes much more resources!)
Perfect result! Issue can be closed. And I highly recommend to include Pool sample in README - it's a huge help for novices! Thanks for all your work and time!