pants icon indicating copy to clipboard operation
pants copied to clipboard

docker: new `docker_mirror_images` target.

Open kaos opened this issue 1 year ago • 13 comments

New docker_mirror_images target to make it easy to mirror images between Docker registries.

kaos avatar Jan 09 '24 15:01 kaos

We've been using this target with great success for over 2 years at Svenska Spel now.. it's time we upstream it ;)

kaos avatar Jan 09 '24 15:01 kaos

For more advanced mirroring needs, you could derive a Pants macro along the lines of what we have done, that either uses this new mirror target, or merely constructs a regular docker_image target based on what's required (there're some in-house specifics here, adjust your eyes accordingly. i.e. this is merely provided as inspiration/example in particular the publish tags only affect some metadata that later stages in our pipeline consumes):

def mirror_external(
    image: str | None = None,
    *,
    images_file: str | None = None,
    images_files: list[str] | None = None,
    version: str | None = None,
    latest: bool = False,
    append_build_number: bool = True,
    append_build_number_separator: str = "-",
    publish: bool = False,
    publish_images: list[str] | None = None,
    **kwargs,
):
    if images_file:
        assert (
            images_files is None
        ), f"Decide if you want to use `images_file` or `images_files`."
        images_files = [images_file]

    if images_files is not None:
        for arg in [image, version, publish or None, append_build_number and None]:
            assert arg is None, (
                "Can not use image specific arguments with `images_file` or `images_files`. Put "
                "image specific data into the `overrides` dict. See the README for examples."
            )

        overrides = kwargs.pop("overrides", {})

        assert not kwargs, (
            "Unknown additional field values provided when using images spec file(s): "
            + ", ".join(kwargs.keys())
        )

        if publish_images:
            for img in publish_images:
                overrides.setdefault(img, {}).setdefault("tags", []).append("publish")

        docker_mirror_images(sources=images_files, overrides=overrides)
        return

    if version is None:
        if image and (image.endswith(":latest") or "@" in image):
            version = "{build_args.DEFAULT_VERSION}"
        else:
            version = "{tags.baseimage}"

    if append_build_number:
        version += append_build_number_separator + "{build_args.BUILD_NUMBER}"

    field_values = {
        "instructions": [f"FROM {image}", *kwargs.pop("instructions", [])]
        if image
        else None,
        "image_tags": [
            version,
            *kwargs.pop("image_tags", []),
        ],
        "tags": (["publish"] if publish else []) + kwargs.pop("tags", []),
    }

    if latest:
        field_values["image_tags"].append("latest")

    docker_image(**field_values, **kwargs)

kaos avatar Jan 09 '24 16:01 kaos

👀 I've got something baking around the OCI backend here; and while I've definitely applied some opinions to it I figured I'd share it as a comparison.

oci_mirror_images(
    name="python-slim-bullseye",
    source_repository="docker.io/library/python",
    destination_repository="europe-docker.pkg.dev/<snip>",
    variants={  # new tag -> source sha
        "3-10-slim-bullseye": "dd28b1d53ab42285edc6534e75a878243868294e012ab1e5b9f020eae19cd16e",
        ...
    }
)

So focusing more on mirroring multiple variants of a single image, though one could obviously make a higher-level target that mirrors multiple paths. I know nothing about docker the tool, but I'm wondering if the choice to represent this as a build step also means you might have a sha256 change even when one is unnecessary (+wasted work) ? Mirroring should be nothing more than copying a few blobs from one server to another, so in theory should be no delta on the actual files.

tgolsson avatar Jan 09 '24 21:01 tgolsson

Thanks Tom, that actually looks more clean. The thing I've posted here is driven by the use cases we needed to support internally, and I wanted a syntax that allowed minimal BUILD file boilerplate for the basic cases with very easy syntax in the spec files. Being a build step does incur a download/upload overhead, but there's no good alternative using the docker tool, as there's no registry-registry copy operation built in, to my knowledge. As there's no new layers introduced, the image sha is unchanged however, but there's no cache to avoid re-mirroring on multiple runs.. we rely on pants "changed since" logic to only mirror a subset of the images based on which spec/BUILD files where changed. Also, in some cases, we introduce some config files or other changes into the mirrored images (so, technically is not a mirrored image any more) but adapted for our uses..

kaos avatar Jan 10 '24 08:01 kaos

Interesting, thanks for the explanation! I'm wondering if those other use-cases are more well-represented as other steps? I think bar the oci_mirror_image the closest equivalent in my plugin would be a oci_image_pull + oci_image_build, where the latter can be published. But the former doesn't apply in a Docker world, because you have a daemon and external cache to contend with in a non-sandboxed world, and pulling can happen either way.

There is no registry-registry copy operation indeed, and no tool (even those that support mirroring) does it that way directly to my knowledge. It has to go via local. But there's no reason it has to be a build step. The shortest semantic route to that would be docker pull -> docker tag -> docker push. That is much more explicit about not wanting any work to be done. It seems to me like if one wants config then one uses a build step, and if one wants a pure mirror then one uses a mirror step.

As a final note: to me, it seems like the more one can keep in a single file, the better. So since Pants uses BUILD files anyways, I'd be hard-pressed to find a good reason for a custom-format text file with barely any content.

As there's no new layers introduced, the image sha is unchanged however, but there's no cache to avoid re-mirroring on multiple runs..

Pardon my ignorance about the docker tool, but I'm wondering if this is categorically true. F.ex. if the upstream image is an OCI image, does invoking a build step (however empty) keep that, or will it pack it into a docker/v2 image after? What if it's v1? Does Docker add any metadata purely from invoking the build step? etc. Not saying you're wrong, just curious.

tgolsson avatar Jan 15 '24 14:01 tgolsson

@fromfile should work for args and env vars too.

benjyw avatar Jan 17 '24 16:01 benjyw

@fromfile should work for args and env vars too.

you missed my point, that's still a "configuration" option, you commented on a target field type.

kaos avatar Jan 18 '24 09:01 kaos

Oh I see. Yes, that is a global config option. But you can give them names in the config dict and refer to them by name.

My point was that you've invented another way to do reusable named config (with the path being the name) and I'm not sure we need to?

benjyw avatar Jan 18 '24 15:01 benjyw

Interesting, thanks for the explanation! I'm wondering if those other use-cases are more well-represented as other steps? I think bar the oci_mirror_image the closest equivalent in my plugin would be a oci_image_pull + oci_image_build, where the latter can be published. But the former doesn't apply in a Docker world, because you have a daemon and external cache to contend with in a non-sandboxed world, and pulling can happen either way.

There is no registry-registry copy operation indeed, and no tool (even those that support mirroring) does it that way directly to my knowledge. It has to go via local. But there's no reason it has to be a build step. The shortest semantic route to that would be docker pull -> docker tag -> docker push. That is much more explicit about not wanting any work to be done. It seems to me like if one wants config then one uses a build step, and if one wants a pure mirror then one uses a mirror step.

Good point. I agree that the pull+tag+push is an optimization worth exploring (but would prefer to do in a follow up). Given that the publish goal works with built artifacts, even the non-build version needs to masquerade as one (hence why I called it optimization).

As a final note: to me, it seems like the more one can keep in a single file, the better. So since Pants uses BUILD files anyways, I'd be hard-pressed to find a good reason for a custom-format text file with barely any content.

I have to main motivations for this:

a) Invalidation. If I have many targets in one BUILD file, changing any one of them will invalidate all, resulting in wasted work for all unchanged targets.

b) It's not uncommon to get the list of images to mirror as output from some tool or script, which makes it far preferable to simply pipe this to a file without further processing (or at least no more processing than to turn them into the file format used for the spec files here, which is easier than merging the data into it's proper location in a BUILD file amidst everything else):

# Example:
$ extract-images some-helm-chart > mirror-images.txt

(if it was one-off perhaps not so much, but consider the case when we want to upgrade the list for a new version of the chart or whatever the source may be.)

As there's no new layers introduced, the image sha is unchanged however, but there's no cache to avoid re-mirroring on multiple runs..

Pardon my ignorance about the docker tool, but I'm wondering if this is categorically true. F.ex. if the upstream image is an OCI image, does invoking a build step (however empty) keep that, or will it pack it into a docker/v2 image after? What if it's v1? Does Docker add any metadata purely from invoking the build step? etc. Not saying you're wrong, just curious.

I only have experience with Docker images, so do not know about OCI ones.. but it has worked well for us. (that is, using public helm charts only extracting used images and adjusting the registry part of them in the charts, not having to change any image hashes at all.)

kaos avatar Jan 18 '24 15:01 kaos

Oh I see. Yes, that is a global config option. But you can give them names in the config dict and refer to them by name.

My point was that you've invented another way to do reusable named config (with the path being the name) and I'm not sure we need to?

Eh..? I'm not following. This is pretty much the same mechanism we have for python_requirements generating targets based on the lines of a requirements.txt file.

I don't see what global config option this would map to? That's an extra step that feels quite superfluous..?

kaos avatar Jan 18 '24 15:01 kaos

Sorry, I guess I saw these as config and you saw them as sources. I'm fine with it as sources.

benjyw avatar Jan 18 '24 17:01 benjyw

Sorry, I guess I saw these as config and you saw them as sources. I'm fine with it as sources.

Ah yes, they're very much sources to me.

kaos avatar Jan 19 '24 08:01 kaos

FWIW I don't have any concerns with this implementation - I think it's a great feature to have. Looks like GCP (which i'm working with now) has support for pull-through and virtual repositories - but this would have been useful when working with ECR.

riisi avatar Jan 24 '24 22:01 riisi