conan icon indicating copy to clipboard operation
conan copied to clipboard

Investigate file de-duplication performance for .tgz due to timestamps

Open memsharded opened this issue 4 months ago • 3 comments

The tgz is different for every different compression, due to checked-out timestamps getting into the tgz.

In general this isn't an issue because the revision systems already checks for de-duplication and avoiding transfers of already existing revisions. But if the revision is not fully indexed, this file de-dup might be triggered.

memsharded avatar Oct 20 '25 10:10 memsharded

This is one of the related tickets: https://github.com/conan-io/conan/issues/2729

Users requesting to maintain/keep the timestamps of artifacts, not wipe them. There was also there a good analysis in https://github.com/conan-io/conan/issues/2729#issuecomment-1307510530, about what other package managers are doing, and it seems they are keeping the times in the compressed artifacts.

memsharded avatar Nov 05 '25 08:11 memsharded

It seems that git doesn't preserve mtimes either, so it would be necessary to completely remove the time information from the tgz for reproducible checksums, but that would go against the other use cases requested.

memsharded avatar Nov 05 '25 09:11 memsharded

As it can be seen in this POC in https://github.com/conan-io/conan/pull/19201, that makes fully reproducible tgz by removing both time info and user/group/mode info, renders the .tgz broken (see the CI) for many cases, specially the user/group/mode seems more problematic here. Maybe that part could be omitted.

memsharded avatar Nov 05 '25 10:11 memsharded