Add support for signing during sync
This commit adds a new stage, RpmArtifactSigningStage, to synchronize.py that signs the rpms as they're being synchronized from an upstream repo. The new stage only does the signing if both a signing service and fingerprint have been set for the repo.
This is fully functional, but there are some questions I have on how best to move forward before moving this out of draft.
- At the moment, signing will happen if the repo has a signing service and fingerprint, but this would be a behavior change with the potential of surprising users. Should this instead be activated as a new sync policy or two ("additive_sign", "mirror_content_only_sign")? Or should it be activated some other way?
- How do we ensure that we're not resyncing packages each time a sync is run? The checksums will obviously be different after redoing the signing, so should we add a field to the database that stores the original checksum?
I'm not sure if solving 2 is a hard requirement for this PR, but we really should decide on something for 1.
Hey, I've gone ahead and moved forward with new sync policies to solve (1), and it turns out that Pulp RPM synchronization already handles (2) by looking at the NEVRA, not the checksums.
This should be ready for review now.
Sorry for not answering the question (1) and (2) before (you started working) and thanks for the contribution! We were in a rush for the last weeks.
That's no problem at all! Thanks so much for the detailed review!
I've commented on a few points already, feel free to ask for any clarification. Also, I'd like to discuss your questions a bit more.
About (1): I've heard (and agreed) a general preference of the team to define the "sign policy" in the repository level instead of a per-call basis. Would that work for you or is there a case where you would need per-call control over it?
This would be absolutely fine. We have no requirement for per-call control over this. I was trying to make minimal structural changes, and sync_policy seemed to be the simplest place to change it. Having said that, I do agree that it makes more sense to be a repository level setting.
If that works, there are some options. First, how do you think the "auto-sign" behavior (like we have in upload) would be a surprise? It's not something that can be configured by accident. Anyway, if we agree its a surprise, an alternative could be adding a
package_signing_enableor similar to the repository, wdyt?
My main concern is that this would be a behavior change. Previously, if someone synced a repo, even if it had the signing service setup, nothing got signed. Now it would be signed. Having said that, there is a good argument that if someone has setup a signing service on a repo, it should be signed, so maybe I'm overthinking this. If you think it's not a big deal, I think I'm OK with dropping my second commit.
About (2): I'm not convinced it won't download and sign it again... You could write a test for this, asserting that no new packages are added to the repository on a second sync.
It sure didn't seem to download and sign again when running the test suite. Packages that had been synced unsigned in a previous test were not replaced with the signed versions in the signing test, and I spent way too long fighting with it before I realized it's because the old unsigned packages were still in the database. That's why I've added delete_orphans_pre to the tests I created.
As you suggest, it would probably be worth adding an explicit test for this.
It's worth mentioning that we have a recently opened issue for this (on the upload case) here. Ideally we should share these kind of package signing handlers between upload and sync, but I don't really think we have a solution in place for this. I would be surprised if we do!
edit: IHO (2) is not a hard requirement for the PR.
Yeah, let me think on this a bit more.
I'll also address the individual issues you brought up.
About (2): I'm not convinced it won't download and sign it again... You could write a test for this, asserting that no new packages are added to the repository on a second sync.
It sure didn't seem to download and sign again when running the test suite. Packages that had been synced unsigned in a previous test were not replaced with the signed versions in the signing test, and I spent way too long fighting with it before I realized it's because the old unsigned packages were still in the database. That's why I've added
delete_orphans_preto the tests I created.
So, after using this a bit in our Pulp instance, I've figured out what was happening. It turns out that it won't download and sign again if and only if the repodata hasn't changed at all. If even one package has been updated in the repo, all the of the packages will be re-signed.
I think the correct solution is to store the fingerprint of any RPM signatures in the database, so we can then ensure that we don't re-sign if the content hash hasn't changed and the correct RPM signature is in the database. What do you think?
So, after using this a bit in our Pulp instance, I've figured out what was happening. It turns out that it won't download and sign again if and only if the repodata hasn't changed at all. If even one package has been updated in the repo, all the of the packages will be re-signed.
That makes sense, we do have an optimization to skip basically the whole sync process if the metadata checksum is unchanged from the previous sync.
I think the correct solution is to store the fingerprint of any RPM signatures in the database, so we can then ensure that we don't re-sign if the content hash hasn't changed and the correct RPM signature is in the database. What do you think?
I assume what you mean is something along the lines of "keep a mapping of the pre-signed package checksums to the new, signed packages"?
I'm pitching in on @jdieter's behalf. We work together.
I think the correct solution is to store the fingerprint of any RPM signatures in the database, so we can then ensure that we don't re-sign if the content hash hasn't changed and the correct RPM signature is in the database. What do you think?
I assume what you mean is something along the lines of "keep a mapping of the pre-signed package checksums to the new, signed packages"?
Right. Store the checksum of the synced package (pre-signing) so that if we see that package again during sync it gets skipped. That's one option.
The other option is to look only at the NEVRA, and if it is already present, don't sign, don't import. It's kind of a feature that old packages don't get sly-updated without extra work; broken packages would have to be deleted from the repository before a new sync to update them. This is a workaround I can get behind, because we wouldn't want things to update unexpectedly anyway.
The other option is to look only at the NEVRA, and if it is already present, don't sign, don't import. It's kind of a feature that old packages don't get sly-updated without extra work; broken packages would have to be deleted from the repository before a new sync to update them. This is a workaround I can get behind, because we wouldn't want things to update unexpectedly anyway.
~As part of this PR, I've implemented this option. If you want me to do the full "keep a mapping of the pre-signed package checksums to the new, signed packages", I can do it, but it will be more invasive.~ UPDATED: I'm using skip package incorrectly. This is not ready for review yet.
The other option is to look only at the NEVRA, and if it is already present, don't sign, don't import. It's kind of a feature that old packages don't get sly-updated without extra work; broken packages would have to be deleted from the repository before a new sync to update them. This is a workaround I can get behind, because we wouldn't want things to update unexpectedly anyway.
Ok, this has been implemented (correctly, this time, I think). Tests are all passing, and I've added a couple more, though I also had to disable two that were expecting the package to change when the NEVRA is the same.
I'm fine if we want to go down the route of adding a field for the original pkgId (and we'll probably also need to add a field for the signing key, so when looking through artifacts with the same originalPkgId, we can determine which signing key was used on each of them), but I wanted to make sure that the logic in here is sane and that what I've done isn't totally off-the-wall.
I'm fine if we want to go down the route of adding a field for the original pkgId (and we'll probably also need to add a field for the signing key, so when looking through artifacts with the same originalPkgId, we can determine which signing key was used on each of them), but I wanted to make sure that the logic in here is sane and that what I've done isn't totally off-the-wall.
I think this would be great to have. We're also interested in not re-signing packages again. However, we've encountered the situation where two packages had the same NEVRA but different contents unfortunately. So checking by package hash digest would work for us whereas NEVRA wouldn't.