dregsy
dregsy copied to clipboard
Add a dry-run option to dregsy
This code is adding a new optional parameter --dry-run
to dregsy's command line.
It allows easier debugging of matching tags based on the specified configuration without syncing all the information to the target docker registry.
- faster
- simple
- cheaper
Great PR! Thanks! A dry-run is definitely a good idea.
Before going into review, could you please remove the commit for building on AMR64? This is a separate concern and should be addressed separately. I think it's best to first open an issue for that. I've been thinking about adding cross-compilation to the project recently, but would probably solve that by moving the build into hack/devenvutil
, and just make parameterized calls to that from the Makefile
.
Once the commit is removed, please squash the remaining two commits, I'll then review.
Removed M1 related changes and stick to the feature of the dry-run
execution.
Since I got into the code I've included additional information:
- show tags from the target registry
- show subtraction of (expanded tags) - (existing tags in target) ~ tags to be synced
- show subtraction of (existing tags in target) - (expanded tags) ~ not synced with the current config
which might also help to find problems and / or information about current state. (also maybe cleanup on target)
I've just merged another PR, so you will have to rebase. Sorry about that! But there are not that many conflicts, so it shouldn't be too complicated.
Mea culpa
Saw it so just did it @xelalexv, code is ready to be reviewed 😜
I did a first review. Here are my comments:
-
I think we should push dry run handling into the relays. The code for expanding source tag lists is already present there, and auth & TLS verify work correctly. With the current implementation, getting tag lists may fail for certain registries, since
certDir
is not set. -
Going into the relays would additionally exercise all the mechanics involved, just short circuit the actual sync. Have a look at the branch I pushed, it's a sketch of what I have in mind.
-
Target tag lists and result output could then be done via a helper object. This could actually be set on the relay instead of just a dry run flag.
-
When retrieving target tag lists, we need to consider the case where the target repo does not yet exist (we should not create it during a dry run). This is easy since all expanded source tags would then apply. We just need to handle it properly.
-
We could improve the output. The dry-run log output seems a bit hard to read, in particular when many tags are involved. Maybe we could create a JSON/YAML/diff file or whatever instead of writing to log. That could then also be processed with generic tools.
-
Make all interval tasks one-off. We may not want the dry-run to run indefinitely.
I've just noticed that starting with Skopeo 1.8.0, the sync
command which we use in dregsy also introduces a dry-run mode. I think we should take advantage of that once that Skopeo version is available via packages in Ubuntu and Alpine. This will have an impact on the design of this feature.
I've just switched to building our own Skopeo binary, and merged those changes. So we can now use latest Skopeo release and start experimenting with Skopeo's dry-run feature.
I've just switched to building our own Skopeo binary, and merged those changes. So we can now use latest Skopeo release and start experimenting with Skopeo's dry-run feature.
I'm going to rebase and give it a try, thanks @xelalexv 👌
I'll be away for the next two weeks, so no need to rush :wink:
In my initial attempt to push the --dry-run
down to relays and...
On skopeo's
relay I've tried to use --dry-run
native's option from skopeo's command and result is:
- it requires to use
sync
subcommand instead ofcopy
(it reminded me of https://github.com/xelalexv/dregsy/issues/58) - it requires to reformat the parameters if we use
sync
- is not as fast as the proposed solution because it actually launches skopeo as many times as tags as we would sync tags
On docker's
relay:
- there is no native
--dry-run
option available as relay usesdocker push
directly, I guess we could "just print" the command if we'd like to do implement some kind of dry-run execution.
Conclusion, maybe we should delegate on the relies the tag obtention but not the actual dry run execution.
I'm going to try it out and will share more details, pushing the current state using native's dry-run option only on skopeo's relay.
Ah, sorry about that! I was jumping to conclusions here. I guess I was too thrilled to see a dry run option in the Skopeo release notes :rofl: But we use skoepo copy
, not sync
, so that doesn't help for now. So I agree, we just push dry run into relays, then short circuit copy
for Skopeo and pull & push for Docker, and make output available in some form.
Interestingly I was going over the code of the relays and found that dockerrelay
depends on skopeo
's because docker does not support listing tags of a repository which makes me wonder relays are the right place for the dry-run
to happen. The biggest advantage is that it keeps things more flexible and adjustable on relay level but source of information is always skopeo's implementation 😌 .
I wonder if we should implement (maybe outside of this pr) a custom relay function to list images without skopeo's dependency using a plain http request to the registry:
# list repositories
curl 'http://localhost:5000/v2/_catalog'
{"repositories":["mirror/nginx","nginx"]}
# list tags for nginx
curl 'http://localhost:5000/v2/nginx/tags/list'
{"name":"nginx","tags":["latest"]}
# list tags for mirror/nginx
curl 'http://localhost:5000/v2/mirror/nginx/tags/list'
{"name":"mirror/nginx","tags":["1.21.1-perl","1.20-perl"]}
~⚠️ I just realised dry-run output generates a single file with only the latest task, newby mistake 🤦~ Adjusted, one file per task * mapping
@xelalexv just rebased the code, could we take a look at the current state? remarks? changes?
Just removed all the non related to the task at hand changes and squashed it to a single commit 👌
Just rebased it again 👌
I'm preparing a release just now, so you'll unfortunately have to rebase once more, but should be trivial. With the release out of the way, we can continue the review of this PR.
Rebased once more 👌
First my apologies for the really long delay! Right after my vacation, I was totally swamped with work.
Now for the PR. It's great progress! Here are a few comments of things we could improve further:
-
Since I recently changed the
relay.Sync(...)
interface to take aSyncOptions
struct, it's maybe cleaner now to not persist the dry run flag in relay instances, but add it to theSyncOptions
when callingrelay.Sync()
. -
I'm not quite sure about dumping the dry run result in to separate output files. I'd prefer a single output. Also, the instantiation of the map you're passing to
util.DumpMapAsJson
is the same in both relays. Since it's quite large, and the keys contained in the map may change as we develop dry run further, the code duplication is a concern. I think it's justifiable to create a dedicated type for dry run result collection, which could be passed in the sync options, and could also function as the dry run flag (see above). -
We should also maintain a hierarchy in the collected data, e.g. instead of saying:
{ "task name": "task 1", "task index": "0", ... }
we could say:
{ "task": { "name": "task 1", "index": "0" }, ... }
I think this could make post-processing of dry run results a lot easier.
-
It should be possible to choose between YAML & JSON for output.
-
Have you tried this with source/target registries that require authentication?
Thanks for the improvement suggestions @xelalexv.
- Since I recently changed the relay.Sync(...) interface to take a SyncOptions struct, it's maybe cleaner now to not persist the dry run flag in relay instances, but add it to the SyncOptions when calling relay.Sync().
Does this mean we expect dry-run
to be configured on task level? I mean SyncOptions
are on task level, right? 🤔
- I'm not quite sure about dumping the dry run result in to separate output files. I'd prefer a single output. Also, the instantiation of the map you're passing to util.DumpMapAsJson is the same in both relays. Since it's quite large, and the keys contained in the map may change as we develop dry run further, the code duplication is a concern. I think it's justifiable to create a dedicated type for dry run result collection, which could be passed in the sync options, and could also function as the dry run flag (see above).
- We should also maintain a hierarchy in the collected data, e.g. instead of saying:
Got it, let me try to do so.
- It should be possible to choose between YAML & JSON for output.
could it be a further iteration?
- Have you tried this with source/target registries that require authentication?
yes, I did and it worked so far, actually I've been using this development version for dry-run
validations for the last weeks.
- Since I recently changed the relay.Sync(...) interface to take a SyncOptions struct, it's maybe cleaner now to not persist the dry run flag in relay instances, but add it to the SyncOptions when calling relay.Sync().
Does this mean we expect
dry-run
to be configured on task level? I meanSyncOptions
are on task level, right?
Hadn't thought about that. Right now it's just about persisting the dry-run flag cleanly in only one place. Passing this then each time via the SyncOptions would open up a way to do per task dry-runs, amongst other things. Whether per task dry-run is meaningful is a different story.
- It should be possible to choose between YAML & JSON for output.
could it be a further iteration?
Of course. We can start with JSON, and then extend later. We just need to keep this in mind during initial implementation.
- Have you tried this with source/target registries that require authentication?
yes, I did and it worked so far, actually I've been using this development version for
dry-run
validations for the last weeks.
Great. I'll try myself later on.
I'm not quite sure about dumping the dry run result in to separate output files. I'd prefer a single output. Also, the instantiation of the map you're passing to util.DumpMapAsJson is the same in both relays. Since it's quite large, and the keys contained in the map may change as we develop dry run further, the code duplication is a concern. I think it's justifiable to create a dedicated type for dry run result collection, which could be passed in the sync options, and could also function as the dry run flag (see above). We should also maintain a hierarchy in the collected data, e.g. instead of saying:
I'm trying to think on how to do this and I do remember why I did generated one file per-task. Right now we are calling once per task to s.relay.Sync(...)
, which means, inside of the function we only know about the currently being executed task, if we would like to generate a map with all the task's results we'd need to return the --dry-run
results back to the ./internal/pkg/sync/sync.go
to then write all the data at once.
Ideas? comments?
... I think it's justifiable to create a dedicated type for dry run result collection, which could be passed in the sync options, and could also function as the dry run flag (see above)...
I'm trying to think on how to do this and I do remember why I did generated one file per-task. Right now we are calling once per task to
s.relay.Sync(...)
, which means, inside of the function we only know about the currently being executed task, if we would like to generate a map with all the task's results we'd need to return the--dry-run
results back to the./internal/pkg/sync/sync.go
to then write all the data at once.Ideas? comments?
In the simplest form the collector could for example just be a map that we create at the start of a dry-run, and which we set in SyncOptions each time we sync a task. Each task then adds its dry-run output under its key. When all tasks are done, we can post-process the collected outputs. We could take this further by making this map a type and attach the post-processing functionality to it.