embedio icon indicating copy to clipboard operation
embedio copied to clipboard

ZipFileProvider is not thread-safe

Open madnik7 opened this issue 3 years ago • 6 comments

Describe the bug WebServers usually send multiple requests to server, it looks like ZipFileStream doesn't support it, and make it useless. It randomly failed to open some entry.

I checked the code:

public Stream OpenFile(string providerPath)
            => _zipArchive.GetEntry(providerPath)?.Open() ?? throw new FileNotFoundException($"\"{providerPath}\" cannot be found in Zip archive.");

ZipArchive is not thread safe, and we can not keep two streams open at the same time. I recommend to lock and read entire entry to a memory stream. Unfortunately this method consume much resources specially if there is a big stream in the zip file.

madnik7 avatar Oct 10 '20 08:10 madnik7

Thanks @madnik7 for the heads-up!

Another option to explore could be having a "service" thread that keeps the Zip archive opened (so the archive directory needs not be read again at each request - it is at the end of the file, so even with a MemoryStream it would probably wreak havoc on caches) and serializes access to entries.

Or maybe we could take both ways and add a property to let application developers choose whether to cache the entire Zip archive in memory. I, for one, would rather leave it on disk in some situations (Raspberry Pi 3, large zip file, infrequent access to web pages) despite the performance hit, but I see how other applications have different requisites.

rdeago avatar Oct 10 '20 09:10 rdeago

I have much experience in working with zip at runtime. Archive directory (zip central directory) is not the only problem; once you open an Archive from a stream, GetEntry keep the current position of the original archive stream, so the next GetEntry completely corrupt the position of the other streams regardless of archive directory. one solution is to have shared Archive directory. Still, even in that case, the archive should not be opened by a stream but a file, so it can open a new stream for each entry. By the way, standard c# ZipArchive does not support it so much work to be done.

You can put the GetStream in a critical section and cache it all the entry stream; since it is the only shared method access via multi-thread, it should solve the issue, but big files memory consumption and the delay response is inevitable.

Now for my current project, I decided to get a hash of the zip stream and extract it to a folder with the name of that hash value and use SystemFileProvider. At least I extract the zip file once, pretty acceptable for my SPA in this project.

madnik7 avatar Oct 10 '20 09:10 madnik7

Putting the GetStream in a critical section would be in fact easier and faster than moving all Zip processing in a separate thread. Thanks for the suggestion!

Of course there is going to be some overhead, especially with large files. The best overhead mitigation strategy depends on application-specific factors, such as which files are accessed more often, cache settings, etc.

Your solution is clever, but I wouldn't implement it as part of ZipFileProvider. If you are willing to share it, maybe as a gist or a wiki page, that would be great!

Thanks for your contribution! Users like you are a great help in keeping up the quality of EmbedIO and open-source software in general.

rdeago avatar Oct 11 '20 16:10 rdeago

of course extracting the zip file was not a solution, I just do it to keep my solution running :)

Also, if you want to keep all zip file in memory, a simple solution is to keep a single copy of the zip stream into a buffer, and use a ZipArchive object in critical section for entry listing, but create a ZipArchive object for every GetStream request.

madnik7 avatar Oct 11 '20 18:10 madnik7

Here are my two cents after having implemented "server data from zip archives" for a project where users have a data library with many small files and use Embed.IO.

Concurrent archive access does exist, but it is complicated. System.IO.Compression and most other implementations cannot do it out of the box.

As madnik7 said, you can have parallel readers which have their own ZipArchive instance. But reading the archive's index takes time and a bit of memory. The memory shouldn't be much of a problem, even large archives with many files are usually below 4MB.

If you have only a few archives, caching some ZipArchive-Readers is a good way. It wasn't for us, as we have ~1000 archives and an unpredictable access pattern, so we ended up implementing a single reader for each archive and use sequential access.

TL/DR either have

  • sequential access to an archive
  • multiple, cached readers

bdurrer avatar Jan 12 '21 10:01 bdurrer

SharpZip appears to be able to handle thread safe reading. From the decompile, looks like SharpZip caches the entries as part of opening the archive.

Using the built in provider, I was getting exceptions "end of central directory record not found" which I believe are associated with the non thread safety of the System.IO.Compression.

SharpZipFileProvider.zip

jswigart avatar Nov 14 '22 20:11 jswigart