sharpcompress
sharpcompress copied to clipboard
When decompressing a file in 7z, there will be a problem of performance degradation
I have conducted some tests, and sharpcompress does suffer from performance degradation when reading some 7z files.
For example, there are 100 files in a 7z archive. When it reads the previous files, such as the first 10, the speed is very fast, but the more it reads the latter files, the slower the decompression speed will be.
There seems to be a rule that the further the file is in the back, the slower the decompression will be.
For example, if you have a 300MB 7z file, you extract one of the 100 files, and it appears in the 10th file order, you will read it very quickly, but if you only read the 70th file, then Even if they have the same file size, the decompression speed will become very slow.
This problem does not occur in the zip format.
PRs are welcome. I've been away for personal reasons.
The speed degradation described above could be due to 7zip grouping files with "Solid Block Size" to achieve better compression. Files at the end of these blocks cannot be directly decompressed. Earlier files must be decompressed first.
The speed degradation described above could be due to 7zip grouping files with "Solid Block Size" to achieve better compression. Files at the end of these blocks cannot be directly decompressed. Earlier files must be decompressed first.
@Nanook Thank you, I use WinRAR to test the same files (use WinRAR to decompress .7z files), and WinRAR works normally, it does not have this bug.
It's not necessarily a bug, but rather a trade off between compression and flexibility. Rar also has SOLID mode which makes the full archive non seekable. 7zip's is at least configurable. Just for info :)
@Nanook No, I meant to use WinRAR to decompress .7z files. It has almost no problem of slowing down. WinRAR is fast at unpacking both .zip and .7z files.
Thanks for the info. Perhaps the SharpCompress implementation can be improved. I'll take a look next time I'm in there.
Thank you very much, looks like this is an old bug, detailed data in this post.
post url: https://github.com/adamhathcock/sharpcompress/issues/399
Looking at the cases it seems to be using archive.ExtractAllEntries() vs archive.Entries.Where(entry => !entry.IsDirectory)
Performance might not be great but the latter does a skip to find the entry you try to decompress..the more files in the 7z file the more to redo/decompress to get to your file.
archive.ExtractAllEntries() while( reader.MoveToNextEntry()) ) ....
Does move along a bit faster
Not sure whether this is related, but I've been using SevenZipSharp for my 7z needs, which has an issue of being extremely slow to process solid 7z archives on a file entry level. However, it also has an ExtractArchive method which works as fast as you might expect.
SevenZipSharp is basically a wrapper around 7z.dll, so I don't know whether that's unhelpful. The DLL is not supported on Linux however, which is how I found out about this project.
This archive extracts in an instant using SevenZipSharp:
var archive = new SevenZipExtractor(@"mod.7z");
archive.ExtractArchive(@"C:\temp");
The same archive takes nine seconds to extract using SharpCompress.
var archive = SevenZipArchive.Open(@"mod.7z");
archive.WriteToDirectory(@"C:\temp");
I randomly stumbled on this issue, I wrote a similar Golang library for reading .7z archives so I'm familiar with this particular phenomenon.
The most efficient way to extract a .7z archive is to iterate over the files in the order they're stored in the archive. If you offer some sort of random access API to the files and implement it naively then you will get this performance degradation. The problem comes from the fact that in order to read file n
in a solid block, you have to read and discard all of the decompressed data for files 0
through n-1
. As n
increases, you have to read and discard more and more data. This is exacerbated if files near the beginning of the block are quite large and you're only interested in some files near the end. You also can't just seek forward into the compressed stream as the state machine of the decompression routine(s) will be confused.
So if you implement your "extract everything in the archive" API in terms of your random access API, i.e. extracting file 0
, then file 1
, etc. it will have worse performance the larger the archive becomes. Whereas a dedicated "extract everything" API that iterates over the archive in one shot will be quick relative to the size of the archive.
I had the exact some problem and fixed this in my library by caching the reader of the decompressed data after reading file n
, so that when reading any file > n
it would use that cached reader instead of recreating it again from scratch. This means that iterating over the files in archive order always results in a cache hit as there's always a cached reader positioned at the end of file n-1
. If I were to sort the files in any way then that would potentially introduce performance degradation again; the worst scenario being to sort the files in reverse order to that of the archive.
Hope that helps.
CreateReaderForSolidExtraction
and SevenZipReader
will access things sequentally in the archive.
7z has the worst of both worlds and the IArchive interface might not the best for it
I created a PR #750 with the extension method that supports my use case of extracting large .7z
files to a new directory. It's super fast compared to WriteToDirectory
, but might not be as feature-rich as needed.
Your PR uses a Task.Run
which is just putting the thread on a different pool. If you want to do that, fine, but that's beyond the scope of this library. True Async needs to go all the way to the Stream.
If you want a different Extract to happen using the Reader, I'd be up for a PR for that