commons-compress
commons-compress copied to clipboard
COMPRESS-514: SevenZFile header buffers over 2G
This is a simple way to enable reading of SevenZFile headers over 2GiB. It also allows a much smaller memory footprint for even the largest headers. The CRC handling needs work/input, because it's only supported for headers held entirely in memory.
When the complete header is not entirely in memory, that leaves several options - for example read the header in two passes, read/parse the header fully while computing CRC (which sort of defeats the purpose) or simply ignore it.
Coverage decreased (-0.08%) to 87.162% when pulling 54be6c4f5ae59b0d6174f36de6500496cdf5ea99 on akelday:COMPRESS-514-SevenZFile into a5ccbd6c55f8df73aa5c13f48aed8d363c722c3a on apache:master.
What's preventing us from implementing the CRC part in HeaderChannelBuffer
? Can we update the CRC while reading from the channel?
You can refer CRC32VerifyingInputStream
which has a similar implemention.
What's preventing us from implementing the CRC part in
HeaderChannelBuffer
? Can we update the CRC while reading from the channel? You can referCRC32VerifyingInputStream
which has a similar implemention.
I'll assume that is the preferred option and will have a look later then. Two downsides I can think of:
- It assumes the header will eventually be fully read. Is that always the case?
- We're acting on the data as the CRC is computed. That means if the CRC was wrong, your code probably already broke by the time you get to the CRC. Perhaps marginally better than completely ignoring though.
It assumes the header will eventually be fully read. Is that always the case?
I believe so. The 7z header is fully parsed in the constructor of SevenZFile
, which means the header is always fully read. Currently we do not provide any method that only read some part of 7z headers.
We're acting on the data as the CRC is computed.
What do you mean by acting?
Currently we only read from the ByteBuffer
, not changing any of them.
From my view, a CRC check is a good way (and maybe the only way) to verify the corrupted 7z headers - especially for large 7z archives like yours.
What do you mean by acting?
Just that branching is decided by the content of the data (e.g. checking header ID bytes then deciding what to do). That implies something quite unexpected could occur long before you reach the CRC value.
I suppose I'm saying from a code safety point of view, you need to confirm CRC before you use the data.
Just that branching is decided by the content of the data (e.g. checking header ID bytes then deciding what to do).
So you are talking about whether or not read the encoded header by the nid
in header. Am I right?
That implies something quite unexpected could occur long before you reach the CRC value.
A little confused here. From my understanding, you are talking about when reading encodedHeader
, the buf
is changed before the original HeaderChannelBuffer
variable buf
is fully read - which may not check the original CRC of the whole header, as the HeaderChannelBuffer
is not fully read out. Am I right here?
Run out of time for now so I'll be brief (will be back in some hours)!
So you are talking about whether or not read the encoded header by the
nid
in header
Yes, that's one possible place a decision is made based on the data of the header. There may be more; I haven't checked all possibilities.
you are talking about when reading
encodedHeader
Actually it applies to a "normal" header too, because that's also swapped for a HeaderBuffer
.
the
buf
is changed before the originalHeaderChannelBuffer
variablebuf
is fully read
I think there are more places than that (again, needs checking).
I see. You are talking about that in some cases the HeaderChannelBuffer
may be changed before it's fully read, then the CRC check sum is not able to be computed anyway.
in some cases the
HeaderChannelBuffer
may be changed before it's fully read
Not really, no. After digging a little deeper into 7z format it seems that particular CRC only applies to the nextHeader
which is most often (maybe always?) the header at the very end of the archive a.k.a the "End Header". That's only a 42 byte header in the 1.2TB archive I had trouble with.
If I understand correctly, we could retain existing capability (a CRC check for the end header) by simply enforcing a HeaderInMemoryBuffer
for that part and still keep the ability to handle a much bigger Compressed Metadata Block.
I'm going to come back to this again later so I can think about it (and get some sleep). Comments welcome in the mean time of course!
Edit: That's done. CRC check is now exactly as before, but we still keep large header support.
Hey @akelday Could you please rebase the commits to squash some commits? Some commits are modifying the same code and should be squashed.
Hi @PeterAlfredLee - squashed to a single commit from latest master, I hope that's OK. Not sure what's going on with coverage, I'll have to look later because it's late now...
Edit: possibly resource (inputstream) is no longer closed correctly so will need to investigate!
Update: definite problem with that (will put more detail in the jira), so this is not OK to merge.
Not sure what's going on with coverage, I'll have to look later because it's late now...
Take it easy. The coverage sometimes report a coverage rate that is not accurate enough.
Please rebase on master. Recent changes should allow all builds to be green including Java 16 and 17-EA.