container-retention-policy icon indicating copy to clipboard operation
container-retention-policy copied to clipboard

Manifest Unknown After Cleanup on Skipped Tag, Amd64 Arch only

Open corinz opened this issue 2 years ago β€’ 15 comments

My container retention job meets expectations, except that every 7th day when it cleans up my "app" container, I am unable to pull the amd64 image. Though, the arm64 image pulls fine. Seems like this cleanup job is deleting a tag which my protected tag depends on? Bizarre behavior, as the tags that are being deleted are totally unrelated and deleting one tag shouldnt affect another. Any thoughts or insight here?

Thanks!

I get this error in my kube cluster every 7th day:

 Failed to pull image "ghcr.io/../app:development": rpc error: code = Unknown desc = manifest unknown

I cant replicate the error from m1 machine (arm arch) -- the pull is successful. From an amd64 machine, I am able to replicate the "manifest unknown" error.

docker pull ghcr.io/../app:development
development: Pulling from ../app
manifest unknown

My retention policy is set to every 7 days, and the "development" tag should be skipped. The tag that was cleaned up in the logs was a truncated hash.

name: Delete old unused GHCR container images 
on:
  schedule:
    - cron: '0 0 * * *'  # every day at midnight
  workflow_dispatch:

jobs:
  clean-ghcr:
    name: Delete old unused GHCR container images
    runs-on: ubuntu-latest
    steps:
      - name: Delete containers older than a week, ignore tags
        uses: snok/container-retention-policy@v1
        with:
          image-names: app
          cut-off: A week ago UTC
          account-type: org
          org-name: my-org
          keep-at-least: 3
          untagged-only: false
          skip-tags: latest, v*, dev*, gamma, beta, 1*, 2*, 3*, 4*, 5*, 6*
          token: ${{ secrets.TOKEN }}

corinz avatar Aug 30 '22 15:08 corinz

That's less than ideal πŸ™‚ The logs don't indicate that the development image itself is deleted, right? Can't say that I've encountered anything like this myself, unfortunately.

sondrelg avatar Aug 30 '22 20:08 sondrelg

@sondrelg thanks for your response.

No, the logs do not show that anything but the image with a sha tag have been deleted. Any ideas for debugging this?

corinz avatar Aug 30 '22 23:08 corinz

The action is really just a few API calls to the Github API, so if you can I think the best thing would be to authenticate locally, then maybe replicate the calls manually.

See:

Finally here is the Github API docs: https://docs.github.com/en/rest/packages#get-a-package-version-for-an-organization

If you find any issues, a PR would be more than welcome πŸ™

sondrelg avatar Aug 31 '22 07:08 sondrelg

@sondrelg Thanks for the info. What prevents the retention policy from deleting images that are depended on by multi-platform tagged versions?

If I understand correctly, the gh api will return a list of versions of a particular package, these versions will include untagged images that are potentially named in the manifest list for a multi-platform image. The retention policy may skip over a named tag, but it may include (for deletion) an image thats named in its manifest list. For example

dev:sha:abc123 {    <-- manifest list, dev tag and sha:abc123 image skipped for deletion
  archA: sha:foo,   <-- eligible for deletion?
  archB: sha:bar
}

If the example above represented a multi-plat manifest list, it would be preserved because it's tagged with "dev", but what about sha:foo and sha:bar images?

corinz avatar Aug 31 '22 15:08 corinz

I've never really used a multi-platform images, so it's very possible we need to add special handling for this case. If I understand you correctly, it sounds like taking manifest lists into consideration should be the default behavior. Currently no such behavior exists.

Do you have a real data example of what this looks like?

sondrelg avatar Aug 31 '22 20:08 sondrelg

@sondrelg Start with this Dockerfile

FROM alpine
RUN mkdir foobar

Execute a multi platform build using Dockers Buildx builder:

docker buildx build --push --platform linux/arm64,linux/amd64 -t <YOUR_REPO_URL>/multi-arch-build .

Inspect the manifest

docker manifest inspect <YOUR_REPO_URL>/multi-arch-build

This will produce a result that looks like this

{
   "schemaVersion": 2,
   "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
   "manifests": [
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 735,
         "digest": "sha256:6619a5ea49cd7174ded29cf5f1c98c559be59edd862349fc3c6238eb6274d3f0",
         "platform": {
            "architecture": "arm64",
            "os": "linux"
         }
      },
      {
         "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
         "size": 735,
         "digest": "sha256:24c08606be10f8db18e7f463e80fd2dc55a411f10d7a0d0beceab4591e3a6441",
         "platform": {
            "architecture": "amd64",
            "os": "linux"
         }
      }
   ]
}

Notice the manifests array that includes 2 objects, one per arch. Each arch has its own container image referenced by the digest.

When we run this clean up job, we clean up those "child" images/digests because they are untagged. AFAIK, there is a simple solution to this. See this post (consider upvoting plz) https://github.com/docker/buildx/discussions/1301

corinz avatar Aug 31 '22 21:08 corinz

Upvoted :+1: I won't be able to look at this in depth for a few days, but I'll do a deep dive as soon as I can, if still needed. Certainly seems like I have all the information I need. In the meantime, as mentioned, contributions are always welcome :slightly_smiling_face:

sondrelg avatar Aug 31 '22 21:08 sondrelg

thanks @sondrelg I'm going to track this down with GHCR, and will contribute if possible.

corinz avatar Aug 31 '22 22:08 corinz

@sondrelg Seems like github doesn't discriminate between a parent container or child container when using the Packages LIST API. What LIST fails to reveal is the graph/dependencies that exist behind the scenes in the container registry. Basically, to do a proper delete, the github api should be avoided, and the registry API should be used. See these api docs for what github is using behind the scenes to manage ghcr: https://github.com/distribution/distribution/blob/main/docs/spec/api.md#deleting-an-image

corinz avatar Sep 01 '22 00:09 corinz

Sorry, I think I missed your last message. I saw the response in the buildx issue, and agree a switch to this API seems like the right choice :+1:

I'll be taking my holidays in a few days, so will have very limited capacity in the next 3 weeks. Are you free to work on this? If not, I guess we could create a new issue for this and get back to it when either one of us (or someone else) does have time :slightly_smiling_face:

sondrelg avatar Sep 14 '22 12:09 sondrelg

@sondrelg I won't have the personal time to do this for a while. But would be good to keep this issue in the backlog!

corinz avatar Oct 07 '22 02:10 corinz

Any news on this one. I just hit the same issue. We've disabled the second arch for the moment, but would like to use both in the future....

Eddman avatar Dec 06 '22 13:12 Eddman

Haven't looked at this since October, mostly since it doesn't affect me personally yet. It will as soon as Github actions lets me build arm images on arm-runners πŸ™ƒ

Would you be interested in implementing a fix @Eddman?

sondrelg avatar Dec 06 '22 13:12 sondrelg

A little question regarding the container registry API, it seems there is no API for listing all untagged manifests, right? So it still requires GitHub Packages API to list all the packages.

xfoxfu avatar Jul 18 '23 12:07 xfoxfu

This can be fixed by explicitly excluding untagged images referred in manifests of tagged images similar to https://github.com/Chizkiyahu/delete-untagged-ghcr-action/blob/278ac5c5ae16914324ba447591af23312af6c075/clean_ghcr.py#L137-L138.

mering avatar Oct 31 '23 22:10 mering

I see @corinz, the description in https://github.com/snok/container-retention-policy/issues/43#issuecomment-1233436362 is really helpful. After looking at this for a little bit, I think this should work as solution:

- name: Fetch SHAs for all associated multi-platform package versions
  id: multi-arch-digests
  run: |
    foo=$(docker manifest inspect ghcr.io/foo | jq -r '.manifests.[] | .digest' | paste -s -d ', ' -)
    bar=$(docker manifest inspect ghcr.io/bar | jq -r '.manifests.[] | .digest' | paste -s -d ', ' -)
    echo "multi-arch-digests=$foo,$bar" >> $GITHUB_OUTPUT

- uses: snok/container-retention-policy
  with:
    ...
    skip-shas: ${{ steps.multi-arch-digests.outputs.multi-arch-digests }}

This would mean implementing a new input for SHAs to avoid deleting, but that seems OK.

I want to release a v2 of the action soon where running a (much) smaller docker container is one of the main things I want to accomplish. Bundling the docker CLI in a container would be a bit of a nuisance, so I think this solution would solve things nicely, while keeping complexity low. Does anyone see any problems with it?

sondrelg avatar May 12 '24 19:05 sondrelg

The latest release adds a skip-shas input argument, which can be used to protect against deleting multi-platform images. Please see the new section in the readme for details, and let me know if anything is unclear.

The migration guide for v3 is included in the release post πŸ‘

If you run into any issues, please share them in the issue opened for tracking the v3 release ☺️

sondrelg avatar Jun 24 '24 21:06 sondrelg