createrepo
createrepo copied to clipboard
Add option to set repodata mtime to specific value
This allows to create repodata reproducibly (together with 'revision' option). One use case is setting this to SOURCE_DATE_EPOCH during installation iso compose, to make it reproducible.
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1430662
This seems like a bad idea, you are just brute forcing the timestamp of the metadata no matter what the input is. If you want to go this way make it a "use this timestamp" instead of the clamp idea.
What I think you want is something more like setting the timestamp of the primary/etc. metadata based on the latest timestamp of the rpms (even this needs more thought though because there are problems here where you have a set of rpms generate metadata and then remove the newest rpm and regenerate the metadata ... and the timestamp goes backwards). The obvious solution is to also take any basedir directories mtime into account.
If you want to go this way make it a "use this timestamp" instead of the clamp idea.
What do you think about binding this with 'revision'? Like --revision XXX --set-timestamp-to-revision
? Both options needs to be used to get reproducible result anyway.
I was also considering only adjusting timestamp fields in repomd.xml, but skip os.utime
calls. But IMO that could be confusing.
(even this needs more thought though because there are problems here where you have a set of rpms generate metadata and then remove the newest rpm and regenerate the metadata ... and the timestamp goes backwards).
This is exactly why why I discarded this idea before (see linked bugzilla comment). As for basedir mtime - do you mean doing this by default, without separate option? Otherwise I don't see why it would be better than an option to force specific timestamp. That would only change how to provide timestamp value (touch --date ...
instead of an option).
A totally different solution could be skipping <timestamp>
nodes in repomd.xml at all. But I'm not sure if that's a valid thing to do.
So the main problem is people use the timestamp value as a meaningful value of "this always goes up, so newer is better". Eg. yum will use it to decide you've hit an old mirror and not download older data. mirrormanager also uses it to order metalink data (the SuSE metalink generator may do something similar). yum will also show the value as the "repo-updated" value in repoinfo.
Like --revision XXX --set-timestamp-to-revision?
If you prefer that I'm fine with it, either way. My main concern was the clamping would do confusing things in weird edge cases, so it'd be better to just always set it to a value instead of clamp.
I was also considering only adjusting timestamp fields in repomd.xml, but skip os.utime calls. But IMO that could be confusing.
I don't think yum cares about the timestamp on the files but I wouldn't trust that until someone looked at the code and did some tests.
As for basedir mtime - do you mean doing this by default, without separate option?
Yeh, like that seems a useful feature that multiple runs of createrepo will generate the exact same repodata ... and I can't even think of a reason to not do it.
A totally different solution could be skipping
nodes in repomd.xml at all. But I'm not sure if that's a valid thing to do.
My guess at that is that everything blows up in some horrible way, but I haven't tested that either :).
Unfortunately the basedir mtime idea wouldn't work that way, because running createrepo
alters that mtime (by creating new repodata
dir). I'll go with setting timestamp to fixed value, same as --revision.
Actually, maybe even separate --set-timestamp-to-revision
switch is not needed? All of it applies to files created during createrepo run, so their mtime will be "now". Always resetting mtime (and so <timestamp>
repomd.xml fields) to --revision shouldn't be a problem, as --revision defaults to "now". This may be even better, because all the repodata will look like modified at the same time, instead of affected by the time needed to compute them. And since repodata is created in temporary directory, it shouldn't result in any outside side effects.
The only effect would be inability to set custom --revision without affecting timestamps, which would matter if one use arbitrary number instead of timestamp there. Is that the case anywhere (doesn't look to be in Fedora at least)?
What do you think about it?
The only effect would be inability to set custom --revision without affecting timestamps, which would matter if one use arbitrary number instead of timestamp there. Is that the case anywhere (doesn't look to be in Fedora at least)? What do you think about it?
I agree with you that we should always strive to keep things simple so, with that in mind, setting mtime of the metadata files and <timestamp>
tags to --revision
seems like a good idea because then you don't need yet another option such as --set-timestamp-to-revision
to get fully reproducible builds.
OTOH, as you say, people may be using that field arbitrarily (all the man page says is "Arbitrary string for a repository revision.") and changing the default behavior could easily introduce unexpected regressions somewhere.
So, if we decide to go down the explicit route, one random idea to avoid having too many new options would be to change the option name --clamp-mtime-to
to just --clamp-timestamp
and its semantics to something like "clamp metadata revision and modification times to the value of the SOURCE_DATE_EPOCH
environment variable, to ensure reproducible builds". That looks like a good blend of configurable mtime/revision as well as keeping things (and options) simple, with the added bonus of supporting the actual specification out of the box.
It's basically what you proposed originally but without the need to specify the desired mtime twice on the command line; once for --clamp-mtime-to
and once for --revision
(it would read the env var instead, and set both the mtime and revision to that value).
One drawback would be that people might not be familiar with the specification and the whole idea, so the option might not make that much sense in such form. What do you think?
If going to support actual specification, mere presence of SOURCE_DATE_EPOCH
variable should trigger this behavior: Build processes MUST use this variable for embedded timestamps in place of the "current" date and time.
. It makes sense as in generic case, the createrepo/createrepo_c call could be buried deep in the build process. In this case, the caller would use "if SOURCE_DATE_EPOCH is set, add --clamp-mtime option" anyway. So I propose to not mix those two things and either:
- trigger this feature by mere presence of
SOURCE_DATE_EPOCH
(per specification), no extra option, or - use explicit value from command line option (either
--revision
, or else)
Here, I think the second version will be less confusing, as there are also other use cases than reproducibly building installation image.
Oh, you're right, thanks for the correction. I didn't read the spec carefully and missed the "MUST use" part.
The first approach of checking the env var presence looks better in this context, however, it's still a bit brittle since you could theoretically have such an env var defined (for whatever reason) without knowing anything about the spec or reproducible builds.
What I also didn't realize is that the linked bugzilla was not originally filed by you and it wasn't about this particular specification.
Given all that, I agree that simply supporting reproducible builds (regardless of the specification) is the better approach here.
I still think, though, that having a single option to achieve this (instead of --revision XXX --set-timestamp-to-revision
) would be nice. By using these two, it's not clear whether that's all that has to be "clamped" or if there are some other variables that could break the reproducibility. So having something like --clamp-timestamp
could make for a nicer UX in terms of ensuring that I really get reproducible builds.
So having something like
--clamp-timestamp
could make for a nicer UX in terms of ensuring that I really get reproducible builds.
I assume with explicit value to that option of course, right?
I'll adjust the patch. And then make similar patch to createrepo_c
...
So having something like
--clamp-timestamp
could make for a nicer UX in terms of ensuring that I really get reproducible builds.I assume with explicit value to that option of course, right? I'll adjust the patch. And then make similar patch to
createrepo_c
...
Yes :) (you can also name it --clamp-timestamp-to
if you wish)
One alternative that also came to mind would be that we would keep the --revision XXX --set-timestamp-to-revision
paradigm but document it in the man page (in a separate section such as "REPRODUCIBLE BUILDS" or something).
But honestly, I don't know what's better. @james-antill , any idea?
I'll adjust the patch. And then make similar patch to
createrepo_c
...
OK. I'll be watching what createrepo_c
thinks about that proposal as well before merging this so that we don't have it implemented in different ways.
Pushed update with --set-timestamp-to-revision
and its documentation.
createrepo_c PR: https://github.com/rpm-software-management/createrepo_c/pull/105