GZipped Excel xlsx files choke
Excel's xlsx format is really just a Zipped XML file. If such files are gzipped, the ReaderFactory seems to try to un-gzip and then un-zip the content. This leads to an IEntry.Key of the first parts of the file rather than the name of the file.
To reproduce:
- Create an Excel Workbook file (*.xlsx)
- Gzip it:
gzip -k Book1.xlsx - Read it with
ReaderFactory:
using (var stream = File.OpenRead(@"C:\\tmp\\Book1.xlsx.gz"))
{
using (var archive = ReaderFactory.Open(stream))
{
while (archive.MoveToNextEntry())
{
archive.Entry.Key.Dump();
}
}
}
The Entry.Key value that is dumped is PK ! A7��n [Content_Types].xml �(�
I'd expect the Key to be the file name Book1.xlsx (or at least not the first lines of the file).
Open to other suggestions on how it should work but as it stands I'd have to special case for *.xlsx.gz files which seems to defeat the purpose of a general ReaderFactory that can handle any of the supported formats you throw at it.
Update: it looks like the underlying issue is that ReaderFactory's call to TarArchive.IsTarFile returns true for *.xlsx files: https://github.com/adamhathcock/sharpcompress/blob/master/src/SharpCompress/Readers/ReaderFactory.cs#L48
@adamhathcock I have been unable to figure out a good way to fix the TarArchive.IsTarFile detection. As part of the proposed fix I moved IsTarFile to the end of the Open. The TarHeader will sometimes accept a file as a Tar in a compressed stream, gz, bz2 etc even when it is not. What are your thoughts on adding an option to ReaderOptions, like TryOpenArchiveInStream? Then make the Open call recursive on a compressed steams. If this is an acceptable solution I am quite happy to create a PR.
public static IReader Open(Stream stream, ReaderOptions options = null)
{
stream.CheckNotNull("stream");
options = options ?? new ReaderOptions()
{
LeaveStreamOpen = false
};
RewindableStream rewindableStream = new RewindableStream(stream);
rewindableStream.StartRecording();
if (ZipArchive.IsZipFile(rewindableStream, options.Password))
{
rewindableStream.Rewind(true);
return ZipReader.Open(rewindableStream, options);
}
rewindableStream.Rewind(false);
if (GZipArchive.IsGZipFile(rewindableStream))
{
rewindableStream.Rewind(false);
GZipStream decompressedStream = new GZipStream(rewindableStream, CompressionMode.Decompress);
if (options.TryOpenArchiveInStream)
{
try { return Open(decompressedStream, options); }
catch (InvalidOperationException) { }
}
rewindableStream.Rewind(true);
return GZipReader.Open(rewindableStream, options);
}
rewindableStream.Rewind(false);
if (BZip2Stream.IsBZip2(rewindableStream))
{
rewindableStream.Rewind(false);
BZip2Stream decompressedStream = new BZip2Stream(new NonDisposingStream(rewindableStream), CompressionMode.Decompress, false);
if (options.TryOpenArchiveInStream)
{
try { return Open(decompressedStream, options); }
catch (InvalidOperationException) { }
}
}
rewindableStream.Rewind(false);
if (LZipStream.IsLZipFile(rewindableStream))
{
rewindableStream.Rewind(false);
LZipStream decompressedStream = new LZipStream(new NonDisposingStream(rewindableStream), CompressionMode.Decompress);
if (options.TryOpenArchiveInStream)
{
try { return Open(decompressedStream, options); }
catch (InvalidOperationException) { }
}
}
rewindableStream.Rewind(false);
if (RarArchive.IsRarFile(rewindableStream, options))
{
rewindableStream.Rewind(true);
return RarReader.Open(rewindableStream, options);
}
rewindableStream.Rewind(false);
if (XZStream.IsXZStream(rewindableStream))
{
rewindableStream.Rewind(true);
XZStream decompressedStream = new XZStream(rewindableStream);
if (options.TryOpenArchiveInStream)
{
try { return Open(decompressedStream, options); }
catch (InvalidOperationException) { }
}
}
rewindableStream.Rewind(false);
if (TarArchive.IsTarFile(rewindableStream))
{
rewindableStream.Rewind(true);
return TarReader.Open(rewindableStream, options);
}
throw new InvalidOperationException("Cannot determine compressed stream type. Supported Reader Formats: Zip, GZip, BZip2, Tar, Rar, LZip, XZ");
}
I have a ZIP-File which worked with version 0.40 but with 0.41 it isn't working.
I'm sorry but the file contains customer data so I can't upload it. I tried to create my own file which reproduced this problem but wasn't able to. Something must be special in this file. If you can give me a hint how to extract some helpful information from the zip file I can add them to this ticket.
In the meanwhile I'm stuck to version 0.40.
@MartinDemberger is it a gzipped Excel workbook? If not, start a new issue. Note that in this issue there is no "the zip file" involved. There's just a *.xlsx (which itself is a zipfile) that is gzipped (not zipped).
Sorry, You are right. It's a zip file with two excel files inside.
There is a known issue with nested zips (which xlsx files are) and Reader. However, something being broke between versions is new.
I will take a look now at the root issue of this issue but nested archives with Reader are usually a problem as I find headers I don't expect
Ironically, I think 0.41.0 fixes the first issue