commons-compress COMPRESS-514: SevenZFile header buffers over 2G

This is a simple way to enable reading of SevenZFile headers over 2GiB. It also allows a much smaller memory footprint for even the largest headers. The CRC handling needs work/input, because it's only supported for headers held entirely in memory.

When the complete header is not entirely in memory, that leaves several options - for example read the header in two passes, read/parse the header fully while computing CRC (which sort of defeats the purpose) or simply ignore it.

May 11 '20 22:05 akelday

Coverage decreased (-0.08%) to 87.162% when pulling 54be6c4f5ae59b0d6174f36de6500496cdf5ea99 on akelday:COMPRESS-514-SevenZFile into a5ccbd6c55f8df73aa5c13f48aed8d363c722c3a on apache:master.

May 11 '20 22:05 coveralls

What's preventing us from implementing the CRC part in HeaderChannelBuffer? Can we update the CRC while reading from the channel? You can refer CRC32VerifyingInputStream which has a similar implemention.

May 12 '20 03:05 PeterAlfredLee

What's preventing us from implementing the CRC part in HeaderChannelBuffer? Can we update the CRC while reading from the channel? You can refer CRC32VerifyingInputStream which has a similar implemention.

I'll assume that is the preferred option and will have a look later then. Two downsides I can think of:

It assumes the header will eventually be fully read. Is that always the case?
We're acting on the data as the CRC is computed. That means if the CRC was wrong, your code probably already broke by the time you get to the CRC. Perhaps marginally better than completely ignoring though.

May 12 '20 11:05 akelday

It assumes the header will eventually be fully read. Is that always the case?

I believe so. The 7z header is fully parsed in the constructor of SevenZFile, which means the header is always fully read. Currently we do not provide any method that only read some part of 7z headers.

We're acting on the data as the CRC is computed.

What do you mean by acting? Currently we only read from the ByteBuffer, not changing any of them.

From my view, a CRC check is a good way (and maybe the only way) to verify the corrupted 7z headers - especially for large 7z archives like yours.

May 12 '20 12:05 PeterAlfredLee

What do you mean by acting?

Just that branching is decided by the content of the data (e.g. checking header ID bytes then deciding what to do). That implies something quite unexpected could occur long before you reach the CRC value.

I suppose I'm saying from a code safety point of view, you need to confirm CRC before you use the data.

May 12 '20 12:05 akelday

Just that branching is decided by the content of the data (e.g. checking header ID bytes then deciding what to do).

So you are talking about whether or not read the encoded header by the nid in header. Am I right?

That implies something quite unexpected could occur long before you reach the CRC value.

A little confused here. From my understanding, you are talking about when reading encodedHeader, the buf is changed before the original HeaderChannelBuffer variable buf is fully read - which may not check the original CRC of the whole header, as the HeaderChannelBuffer is not fully read out. Am I right here?

May 12 '20 12:05 PeterAlfredLee

Run out of time for now so I'll be brief (will be back in some hours)!

So you are talking about whether or not read the encoded header by the nid in header

Yes, that's one possible place a decision is made based on the data of the header. There may be more; I haven't checked all possibilities.

you are talking about when reading encodedHeader

Actually it applies to a "normal" header too, because that's also swapped for a HeaderBuffer.

the buf is changed before the original HeaderChannelBuffer variable buf is fully read

I think there are more places than that (again, needs checking).

May 12 '20 12:05 akelday

I see. You are talking about that in some cases the HeaderChannelBuffer may be changed before it's fully read, then the CRC check sum is not able to be computed anyway.

May 12 '20 13:05 PeterAlfredLee

in some cases the HeaderChannelBuffer may be changed before it's fully read

Not really, no. After digging a little deeper into 7z format it seems that particular CRC only applies to the nextHeader which is most often (maybe always?) the header at the very end of the archive a.k.a the "End Header". That's only a 42 byte header in the 1.2TB archive I had trouble with.

If I understand correctly, we could retain existing capability (a CRC check for the end header) by simply enforcing a HeaderInMemoryBuffer for that part and still keep the ability to handle a much bigger Compressed Metadata Block.

I'm going to come back to this again later so I can think about it (and get some sleep). Comments welcome in the mean time of course!

Edit: That's done. CRC check is now exactly as before, but we still keep large header support.

May 12 '20 22:05 akelday

Hey @akelday Could you please rebase the commits to squash some commits? Some commits are modifying the same code and should be squashed.

May 16 '20 02:05 PeterAlfredLee

Hi @PeterAlfredLee - squashed to a single commit from latest master, I hope that's OK. Not sure what's going on with coverage, I'll have to look later because it's late now...

Edit: possibly resource (inputstream) is no longer closed correctly so will need to investigate!

Update: definite problem with that (will put more detail in the jira), so this is not OK to merge.

May 17 '20 22:05 akelday

Not sure what's going on with coverage, I'll have to look later because it's late now...

Take it easy. The coverage sometimes report a coverage rate that is not accurate enough.

Sep 10 '20 03:09 PeterAlfredLee

Please rebase on master. Recent changes should allow all builds to be green including Java 16 and 17-EA.

Jun 12 '21 12:06 garydgregory