flit icon indicating copy to clipboard operation
flit copied to clipboard

Compute SOURCE_DATE_EPOCH automatically (e.g., from git)

Open paulproteus opened this issue 5 years ago • 9 comments

Hi @takluyver !

I was reading https://flit.readthedocs.io/en/latest/reproducible.html , and I was excited to see that reproducible builds are something you've considered with flit. I wondered if there was a way to make reproducibility automatic.

Toward that goal, I'm interested in working with you to design a user flow where flit can compute the SOURCE_DATE_EPOCH from git. I'm also interested in submitting a patch that implements this, with an approximate ETA of 2-3 weeks from when we decide how it should work, if that's helpful. I also have some other questions about this, which you'll find below.

I can think of a few implementation strategies:

  • flit could automatically check the git log for the most recent commit, and use the timestamp as the SOURCE_DATE_EPOCH, if this is a git repository (using the same logic as flit already uses).

  • flit publish (and/or flit build) could accept a flag called e.g. --source-date-from-vcs, so that users have to opt-in to this behavior.

  • flit publish (and/or flit build) could accept a flag called --use-timestamps, in which case, we use the current behavior of embedding timestamps (subject to SOURCE_DATE_EPOCH). If that flag is omitted, we use some zero timestamp as the timestamp in wheels and tar files.

I have a few related questions:

  • The doc at https://flit.readthedocs.io/en/latest/reproducible.html seems to imply that a sdist doesn't depend on the current time. My experiments show that the sdist built by flit build do contain timestamps of files. Should I change that doc? Happy to do so. (Context: I did flit build on a clone of this repo, then did tar -t -v --full-time -f dist/flit-2.2.0.tar.gz | grep 2020-01-31 | head -n1 to confirm that I got a value from today, even though the repo is unchanged since a couple of weeks ago.)

  • flit is embedding a timestamp of 2016-01-01 for its generated files in the wheel https://github.com/takluyver/flit/blame/93482e44d804b80c770203ec6c2e375170ece9ef/flit_core/flit_core/wheel.py#L173 but not for the user's code. Would it be better and simpler to use this 2016-01-01 timestamp for user code, too? This would also address my goal here, I believe. :)

  • If we go ahead with the git-based implementation strategy, would you want this to work equally well for hg as part of merging the patch? If so, I can commit to adding that to my patch before submitting it for your final review.

Happy to chat more, either here or somewhere lower-latency if you desire.

If this is a direction you're not interested in, it's okay if you close this. :)

Cheers!

paulproteus avatar Jan 31 '20 21:01 paulproteus

Hi Asheesh! Good to hear from you again. :-)

Yes, I think getting the source date from the most recent git commit sounds like a sensible thing to do. I'm still considering whether I want it to be the default behaviour. I'm leaning towards yes, because the timestamps from the files are probably not much use, but I'm not 100% convinced yet.

I only did the reproducibility work for wheels, not sdists. If Python packaging gains more general infrastructure for reproducible builds, I imagine wheels will be the focus, because that's where compiled artifacts can be. Sdists are meant to be source, so it should be easier to diff the unpacked files against e.g. a source tree from git (I've also been thinking about tools to make that easy). So I think reproducible sdists are less valuable. But if you want them and it doesn't make the code significantly harder to look at, I probably won't object.

Re 2016-01-01: I'm reluctant to overwrite real information (the file timestamps) with something entirely meaningless. It's a different calculation for the files flit generates, because they would otherwise have a different timestamp each time you flit build. Overwriting them with VCS timestamps seems more reasonable than with an arbitrary date.

I don't think we need an equivalent hg implementation in the first instance - it's a nice enhancement, but not an essential feature, so doing it with other VCSs can be left for another day.

takluyver avatar Feb 01 '20 21:02 takluyver

Great!

I'll just petition you once more for "don't store the dates in the build artifact". :) I think no one really needs the dates within the wheel; the version number should be enough for that. If the timestamps of the files aren't of much use, then we can accomplish the underlying goal with even less code, which is a nice feat.

I'm also still happy to build the git version-based version of this, so I'll intend to do that in the coming weeks unless I hear otherwise, and ping you when I have something for you to look at.

Cheers!

paulproteus avatar Feb 05 '20 21:02 paulproteus

Maybe. In principle, the filesystem timestamps still provide a kind of crude record if someone manages the master copy of the code on a single computer, with no version control. But that's obviously very limited, and a pain to make sense of. And it's common these days to put even small projects in git.

Let's build timestamps-from-git first and see how that goes for a while before changing what it does with no recognised VCS.

takluyver avatar Feb 09 '20 12:02 takluyver

+1

paulproteus avatar Feb 10 '20 07:02 paulproteus

Just a note that if you use zero timestamps, you will need to use strict_timestamps=False with zipfile to avoid errors. It might not be a bad idea to do that regardless.

jmroot avatar Nov 20 '21 19:11 jmroot

Yup, I actually had to fix that recently for timestamps from SOURCE_DATE_EPOCH (#448).

takluyver avatar Nov 20 '21 20:11 takluyver

Ah, interesting. I found that the ZipInfo.from_file call still fails when the file has a pre-1980 mtime. (Long story involving a not-quite-pax compatible unarchiver and the tarfile module.)

jmroot avatar Nov 20 '21 23:11 jmroot

Good point. I added the ZipInfo.from_file method while working on Flit, and I only found out about the 1980 limitation much more recently. So it makes sense that from_file doesn't work around that. :slightly_smiling_face:

takluyver avatar Nov 21 '21 11:11 takluyver

I ended up doing this to avoid the problem.

--- flit_core/wheel.py.orig	2021-11-15 01:34:30.000000000 +1100
+++ flit_core/wheel.py	2021-11-21 17:00:14.000000000 +1100
@@ -7,6 +7,7 @@
 import os
 import os.path as osp
 import stat
+import sys
 import tempfile
 from types import SimpleNamespace
 from typing import Optional
@@ -102,7 +103,10 @@
             rel_path = rel_path.replace(os.sep, '/')
 
         if self.source_time_stamp is None:
-            zinfo = zipfile.ZipInfo.from_file(full_path, rel_path)
+            if sys.version_info[:2] >= (3, 8):
+                zinfo = zipfile.ZipInfo.from_file(full_path, rel_path, strict_timestamps=False)
+            else:
+                zinfo = zipfile.ZipInfo.from_file(full_path, rel_path)
         else:
             # Set timestamps in zipfile for reproducible build
             zinfo = zipfile.ZipInfo(rel_path, self.source_time_stamp)

jmroot avatar Nov 21 '21 20:11 jmroot