dregsy Add a dry-run option to dregsy

This code is adding a new optional parameter --dry-run to dregsy's command line. It allows easier debugging of matching tags based on the specified configuration without syncing all the information to the target docker registry.

faster
simple
cheaper

Jun 01 '22 13:06 daconstenla

Great PR! Thanks! A dry-run is definitely a good idea.

Before going into review, could you please remove the commit for building on AMR64? This is a separate concern and should be addressed separately. I think it's best to first open an issue for that. I've been thinking about adding cross-compilation to the project recently, but would probably solve that by moving the build into hack/devenvutil, and just make parameterized calls to that from the Makefile.

Once the commit is removed, please squash the remaining two commits, I'll then review.

Jun 01 '22 15:06 xelalexv

Removed M1 related changes and stick to the feature of the dry-run execution.

Since I got into the code I've included additional information:

show tags from the target registry
show subtraction of (expanded tags) - (existing tags in target) ~ tags to be synced
show subtraction of (existing tags in target) - (expanded tags) ~ not synced with the current config

which might also help to find problems and / or information about current state. (also maybe cleanup on target)

Jun 01 '22 17:06 daconstenla

I've just merged another PR, so you will have to rebase. Sorry about that! But there are not that many conflicts, so it shouldn't be too complicated.

Mea culpa

Jun 01 '22 17:06 xelalexv

Saw it so just did it @xelalexv, code is ready to be reviewed 😜

Jun 01 '22 17:06 daconstenla

I did a first review. Here are my comments:

I think we should push dry run handling into the relays. The code for expanding source tag lists is already present there, and auth & TLS verify work correctly. With the current implementation, getting tag lists may fail for certain registries, since certDir is not set.
Going into the relays would additionally exercise all the mechanics involved, just short circuit the actual sync. Have a look at the branch I pushed, it's a sketch of what I have in mind.
Target tag lists and result output could then be done via a helper object. This could actually be set on the relay instead of just a dry run flag.
When retrieving target tag lists, we need to consider the case where the target repo does not yet exist (we should not create it during a dry run). This is easy since all expanded source tags would then apply. We just need to handle it properly.
We could improve the output. The dry-run log output seems a bit hard to read, in particular when many tags are involved. Maybe we could create a JSON/YAML/diff file or whatever instead of writing to log. That could then also be processed with generic tools.
Make all interval tasks one-off. We may not want the dry-run to run indefinitely.

Jun 02 '22 06:06 xelalexv

I've just noticed that starting with Skopeo 1.8.0, the sync command which we use in dregsy also introduces a dry-run mode. I think we should take advantage of that once that Skopeo version is available via packages in Ubuntu and Alpine. This will have an impact on the design of this feature.

Jun 03 '22 06:06 xelalexv

I've just switched to building our own Skopeo binary, and merged those changes. So we can now use latest Skopeo release and start experimenting with Skopeo's dry-run feature.

Jun 03 '22 14:06 xelalexv

I've just switched to building our own Skopeo binary, and merged those changes. So we can now use latest Skopeo release and start experimenting with Skopeo's dry-run feature.

I'm going to rebase and give it a try, thanks @xelalexv 👌

Jun 03 '22 14:06 daconstenla

I'll be away for the next two weeks, so no need to rush :wink:

Jun 03 '22 16:06 xelalexv

In my initial attempt to push the --dry-run down to relays and...

On skopeo's relay I've tried to use --dry-run native's option from skopeo's command and result is:

it requires to use sync subcommand instead of copy (it reminded me of https://github.com/xelalexv/dregsy/issues/58)
it requires to reformat the parameters if we use sync
is not as fast as the proposed solution because it actually launches skopeo as many times as tags as we would sync tags

On docker's relay:

there is no native --dry-run option available as relay uses docker push directly, I guess we could "just print" the command if we'd like to do implement some kind of dry-run execution.

Conclusion, maybe we should delegate on the relies the tag obtention but not the actual dry run execution.

I'm going to try it out and will share more details, pushing the current state using native's dry-run option only on skopeo's relay.

Jun 03 '22 18:06 daconstenla

Ah, sorry about that! I was jumping to conclusions here. I guess I was too thrilled to see a dry run option in the Skopeo release notes :rofl: But we use skoepo copy, not sync, so that doesn't help for now. So I agree, we just push dry run into relays, then short circuit copy for Skopeo and pull & push for Docker, and make output available in some form.

Jun 03 '22 19:06 xelalexv

Interestingly I was going over the code of the relays and found that dockerrelay depends on skopeo's because docker does not support listing tags of a repository which makes me wonder relays are the right place for the dry-run to happen. The biggest advantage is that it keeps things more flexible and adjustable on relay level but source of information is always skopeo's implementation 😌 .

I wonder if we should implement (maybe outside of this pr) a custom relay function to list images without skopeo's dependency using a plain http request to the registry:

# list repositories
curl  'http://localhost:5000/v2/_catalog'
{"repositories":["mirror/nginx","nginx"]}

# list tags for nginx
curl  'http://localhost:5000/v2/nginx/tags/list'  
{"name":"nginx","tags":["latest"]}

# list tags for mirror/nginx
curl  'http://localhost:5000/v2/mirror/nginx/tags/list'
{"name":"mirror/nginx","tags":["1.21.1-perl","1.20-perl"]}

Jun 07 '22 06:06 daconstenla

~⚠️ I just realised dry-run output generates a single file with only the latest task, newby mistake 🤦~ Adjusted, one file per task * mapping

Jun 07 '22 08:06 daconstenla

@xelalexv just rebased the code, could we take a look at the current state? remarks? changes?

Jun 16 '22 07:06 daconstenla

Just removed all the non related to the task at hand changes and squashed it to a single commit 👌

Jun 16 '22 07:06 daconstenla

Just rebased it again 👌

Jun 20 '22 12:06 daconstenla

I'm preparing a release just now, so you'll unfortunately have to rebase once more, but should be trivial. With the release out of the way, we can continue the review of this PR.

Jun 20 '22 12:06 xelalexv

Rebased once more 👌

Jun 21 '22 14:06 daconstenla

First my apologies for the really long delay! Right after my vacation, I was totally swamped with work.

Now for the PR. It's great progress! Here are a few comments of things we could improve further:

Since I recently changed the relay.Sync(...) interface to take a SyncOptions struct, it's maybe cleaner now to not persist the dry run flag in relay instances, but add it to the SyncOptions when calling relay.Sync().
I'm not quite sure about dumping the dry run result in to separate output files. I'd prefer a single output. Also, the instantiation of the map you're passing to util.DumpMapAsJson is the same in both relays. Since it's quite large, and the keys contained in the map may change as we develop dry run further, the code duplication is a concern. I think it's justifiable to create a dedicated type for dry run result collection, which could be passed in the sync options, and could also function as the dry run flag (see above).
We should also maintain a hierarchy in the collected data, e.g. instead of saying:
```
{
    "task name": "task 1",
    "task index": "0",
    ...
}
```
we could say:
```
{
    "task": {
        "name": "task 1",
        "index": "0"
    },
    ...
}
```
I think this could make post-processing of dry run results a lot easier.
It should be possible to choose between YAML & JSON for output.
Have you tried this with source/target registries that require authentication?

Jul 14 '22 09:07 xelalexv

Thanks for the improvement suggestions @xelalexv.

Since I recently changed the relay.Sync(...) interface to take a SyncOptions struct, it's maybe cleaner now to not persist the dry run flag in relay instances, but add it to the SyncOptions when calling relay.Sync().

Does this mean we expect dry-run to be configured on task level? I mean SyncOptions are on task level, right? 🤔

I'm not quite sure about dumping the dry run result in to separate output files. I'd prefer a single output. Also, the instantiation of the map you're passing to util.DumpMapAsJson is the same in both relays. Since it's quite large, and the keys contained in the map may change as we develop dry run further, the code duplication is a concern. I think it's justifiable to create a dedicated type for dry run result collection, which could be passed in the sync options, and could also function as the dry run flag (see above).

We should also maintain a hierarchy in the collected data, e.g. instead of saying:

Got it, let me try to do so.

It should be possible to choose between YAML & JSON for output.

could it be a further iteration?

Have you tried this with source/target registries that require authentication?

yes, I did and it worked so far, actually I've been using this development version for dry-run validations for the last weeks.

Jul 20 '22 11:07 daconstenla

Since I recently changed the relay.Sync(...) interface to take a SyncOptions struct, it's maybe cleaner now to not persist the dry run flag in relay instances, but add it to the SyncOptions when calling relay.Sync().

Does this mean we expect dry-run to be configured on task level? I mean SyncOptions are on task level, right?

Hadn't thought about that. Right now it's just about persisting the dry-run flag cleanly in only one place. Passing this then each time via the SyncOptions would open up a way to do per task dry-runs, amongst other things. Whether per task dry-run is meaningful is a different story.

It should be possible to choose between YAML & JSON for output.

could it be a further iteration?

Of course. We can start with JSON, and then extend later. We just need to keep this in mind during initial implementation.

Have you tried this with source/target registries that require authentication?

yes, I did and it worked so far, actually I've been using this development version for dry-run validations for the last weeks.

Great. I'll try myself later on.

Jul 20 '22 11:07 xelalexv

I'm not quite sure about dumping the dry run result in to separate output files. I'd prefer a single output. Also, the instantiation of the map you're passing to util.DumpMapAsJson is the same in both relays. Since it's quite large, and the keys contained in the map may change as we develop dry run further, the code duplication is a concern. I think it's justifiable to create a dedicated type for dry run result collection, which could be passed in the sync options, and could also function as the dry run flag (see above). We should also maintain a hierarchy in the collected data, e.g. instead of saying:

I'm trying to think on how to do this and I do remember why I did generated one file per-task. Right now we are calling once per task to s.relay.Sync(...), which means, inside of the function we only know about the currently being executed task, if we would like to generate a map with all the task's results we'd need to return the --dry-run results back to the ./internal/pkg/sync/sync.go to then write all the data at once.

Ideas? comments?

Jul 20 '22 12:07 daconstenla

... I think it's justifiable to create a dedicated type for dry run result collection, which could be passed in the sync options, and could also function as the dry run flag (see above)...

I'm trying to think on how to do this and I do remember why I did generated one file per-task. Right now we are calling once per task to s.relay.Sync(...), which means, inside of the function we only know about the currently being executed task, if we would like to generate a map with all the task's results we'd need to return the --dry-run results back to the ./internal/pkg/sync/sync.go to then write all the data at once.

Ideas? comments?

In the simplest form the collector could for example just be a map that we create at the start of a dry-run, and which we set in SyncOptions each time we sync a task. Each task then adds its dry-run output under its key. When all tasks are done, we can post-process the collected outputs. We could take this further by making this map a type and attach the post-processing functionality to it.

Jul 20 '22 13:07 xelalexv

dregsy dregsy copied to clipboard

Add a dry-run option to dregsy

dregsy
dregsy copied to clipboard