Strange .tbz seemingly only gets its first block extracted

Open mihelm opened this issue 1 year ago • 1 comments

I've stumbled upon a very strange .tbz archive:

https://macromates.com

For some reason, BZip2.decompress(data: data) only spits out the first block (at least I think that's what's happening, since the "decompressed" data comes out at 900 KB). Naturally, TarContainer can't do anything with that.

Archive Utility can open the downloaded file. And when I create a .tbz from the expanded app with no funny settings, the resulting archive works just fine with BZip2.decompress(data: data) — and my archive is even a little smaller than the offered download.

I have no idea what they did there. Maybe an interesting edge case to play around with...

Cheers, michael.

Oct 21 '24 14:10 mihelm

Thank you for your report! Sorry for the very slow response time. I actually investigated the issue when you reported it and found the root cause, but I've never decided on what the proper solution should be and then I got distracted...

Anyway, the BZip2 file you provided is actually multiple BZip2 files concatenated together. So the result you're getting from SWCompression is not the first block, but the first file of those. This concatenated input is definitely unexpected for SWCompression, but since Archive Utility deals with them, I suppose, SWCompression should too. The only question is how.

I believe, the situation is quite similar to the GZip case where you can also have files consisting of multiple GZip archives ("members", I think, they are called in the specification). To deal with such files a special function exists in SWCompression, while the normal unarchive function processes only the first member of the archive, exactly what BZip2.decompress does in your case.

So initially I thought to add a similar separate function for BZip2, but I realized that it wouldn't have been really helpful in your case. If it had already existed, you wouldn't even know that you have to use it and that you're actually dealing with several BZip2 archives combined together. But at the same time changing the default behavior of BZip2.decompress in such a drastic way to handle these files also doesn't seem right... So this is where I got stuck at the time.

As a side note, I believed (and maybe still do) that in GZip case these "multi-member" GZip archives are exceedingly rare and if you have one in your hands, you know about its nature and you would understand that you have to use a special SWCompression function to process it. I am not so sure about this presumption in BZip2 case.

As a second side note, I also do not understand why would someone create these "multi-member" GZip archives or multi-archive BZip2 files. Each member/archive adds a noticeable amount of metadata into the compressed output, which at least in principle defeats the entire purpose of compression (or reduces its efficiency, to be less dramatic).

Feb 12 '25 16:02 tsolomko