casync icon indicating copy to clipboard operation
casync copied to clipboard

RFE: Ability to build caidx from catar

Open charles-dyfis-net opened this issue 6 years ago • 7 comments

For installation performance (streaming one file being considerably faster than a multitude of little ones), I was intending to generate a .catar as part of my build process, archiving it to a .caidx for regeneration of historical builds.

However, it doesn't appear to be possible to build a .caidx directly from a .catar without an interim unpack or FUSE-mounting step:

$ casync make test.caidx root.catar
Input is a regular file or block device, but attempted to make a directory archive. Refusing.
$ casync make --what=archive-index test.caidx --what=archive root.catar
Input is a regular file or block device, but attempted to make a directory archive. Refusing.

Does this mean I'm obliged to generate a .caidx at build time and then build a .catar from that, if intending to have both as build products?

charles-dyfis-net avatar Oct 18 '17 18:10 charles-dyfis-net

Well, I also came across this drawback. From my point of view this is a feature casync definitely should support in the near future.

I would go even further and say that supporting extraction from existing archive formats such as tar would be a very useful feature (while not being sure if reproduciblity is an issue here).

When we really want to use casync out in the field, for deploying our Linux systems to our IoT or whatever targets, we must have a convenience for using the artifacts that build systems such as Yocto, PTXdist or buildroot produce. They build either file-system images, which are not a problem for casync, or tar archives, which I currently have to tar-unpack and casync-make in a fakeroot shell (for my integration of casync into the RAUC update framework).

In the future we could teach those systems to build .catar archives, yes. But it does not feel like we want those build systems care for index and chunk generation, as this might really depend on 'external' factors such as the infrastructure we intend to deploy our artifacts with. Thus having a conversion from archive to index/chunk would be quite useful for using casync in further real-world scenarios.

Are there any plans in supporting archive conversion?

ejoerns avatar Jan 19 '18 15:01 ejoerns

@charles-dyfis-net well what does that provide me?

Similarity of tar archives to either directory tree or block devices on a target will go down to finding chunks that only consist of pieces of larger files, thus not very ideal for saving bandwith.

ejoerns avatar Jan 19 '18 16:01 ejoerns

I think one possible advantage is a more stable ordering of the files in a directory, which leads to more stable pattern across two similar file trees? In that to achieve the best result, this ordering should depend on the file themselves (probably their names), and not in the order they were created.

gyscos avatar Jan 19 '18 18:01 gyscos

Well, I am not sure if we are talking about the same things.

My use case is extracting a caidx file (with a remote chunk store) to a directory tree while using another directory tree as a seed for this.

The source for the file tree I like to have chunked as caidx currently is a tar.bz2 generated by my embedded linux build system (OE in this case). What I currently do to achieve this is extracting the tar and then run casync make on the resulting directory tree (both in the same fakeroot).

What I expect is that if the directory content of my tar equals the content of my directory tree I use for seeding, then the chunks of the caidx I generated will totally equal the calculated seed chunks and I do not have to get further chunks from remote.

Simply chunking a tar file (i.e. generating a tar.caibx) should generate chunks highly diverging from the chunks of my directory tree. For a catar instead I expect a different (much better) result, as - if I got it right - catar uses the exact same serialization that is used when chunking directory trees for caidx.

A possible tar to caidx conversion I had in mind would have to use the tar library to extract the directory tree while re-serializing and chunking it.

ejoerns avatar Jan 19 '18 23:01 ejoerns

I get what you're saying now. I don't agree that the divergence you discuss is inevitable -- as mentioned, tar can be made stable -- but it's certainly some work to avoid. The use case you discuss didn't come to mind because I'm using a chunkstore cache rather than seeding to minimize bandwidth usage.

(Personally, by the way, I'm using uncompressed squashfs images -- generated by a mksquashfs toolchain with some patches from the Tails project, signed with dm_verity data calculated against a fixed root for embedded system distribution with minimal deltas, but that's only suitable if you have a read-only root).

charles-dyfis-net avatar Jan 20 '18 00:01 charles-dyfis-net

On analysis, it looks like the only change needed to fix this support is copying the feature flags from the .catar into the generated .caidx. Since the catar header includes a 64-bit constant, it does seem fairly safe to trust in our ability to recognize it.

charles-dyfis-net avatar Aug 02 '18 22:08 charles-dyfis-net

For anyone curious, the PR fixing the parallel issue in desync is folbricht/desync#56

charles-dyfis-net avatar Aug 27 '18 20:08 charles-dyfis-net