HDiffPatch icon indicating copy to clipboard operation
HDiffPatch copied to clipboard

[Feature Request] Combine generated patch files

Open GrgDev opened this issue 3 years ago • 4 comments

Apologies if this is already covered by the current functionality, but I didn't see support for this in the usage documentation.

It would be nice if there was a way to derive/combine patches from previously existing patches without them being a naive concatenation.

For example, if I have a binary file with versions A, B, and C, and I have already generated patch files AB (the diff between A and B) and BC (the diff between B and C), I would like to be able to run a command like hdiffz --combine AB BC AC to generate the equivalent output diff to if I had run hdiffz A C AC.

GrgDev avatar Jul 18 '22 16:07 GrgDev

It should be possible to implement this feature, but probably not the best patch file: sizeof(combine(AB,BC)) >= sizeof(hdiffz(A,C))

sisong avatar Jul 19 '22 03:07 sisong

It makes sense that sizeof(combine(AB,BC)) would be at least == sizeof(hdiffz(A,C)) but didn't expect a significant file size inflation if the equivalent diff could be derived. That's interesting. I don't know the math behind this so you would know best.

I should probably add onto here the use case I am requesting this for in case there's a better means. I was looking into the best approach to a situation where there are many many versions of a large binary file (from 4GB to 9GB across versions) and generating the full matrix of version-to-version patch files would be an expensive task that would continue to get larger with every release and therefore a larger set of patches to generate with every release. I figured if instead of generating the full matrix of version-to-version patches, each version simply had a single patch file generated between itself and the previous version, and if combining patches could be a cheap operation, then the actual patch between the arbitrary versions that's later needed could be generated on demand by combining the existing patches.

GrgDev avatar Aug 16 '22 17:08 GrgDev

I did some simple experiments: (hdiff out uncompressed data, 7zCompress is 7zip with 128M dict)

AB=hdiff(A,B)
BC=hdiff(B,C)
AC=hdiff(A,C)
ABC=hdiff(AB,BC)

Result: 7zCompress(AC) << 7zCompress(ABC) < 7zCompress(AB)+7zCompress(BC) The effect is not very good, the income( 7zCompress(AB)+7zCompress(BC)-7zCompress(ABC) ) is relatively small.

Also the combine algorithm(not: apply 2 times) is not easy to implement, or I don't know how.
For multiple historical versions create patches to the current latest version; The computational cost is relatively large, but it is still cost-effective compared to the saved patch size.
Now, for large app, some teams use hdiffz -s -c-zstd to save time and machine resources when creating patches; there are also teams configured with 64 core CPU + 512GB memory machines to use hdiffz -m to create the smallest patches.

sisong avatar Aug 20 '22 03:08 sisong

Thank you for this feedback and information. This is really helpful!

GrgDev avatar Aug 22 '22 17:08 GrgDev

#rejected

sisong avatar Oct 21 '22 04:10 sisong