manifests icon indicating copy to clipboard operation
manifests copied to clipboard

Switch to GHCR due to docker.io pull rate limits

Open juliusvonkohout opened this issue 10 months ago • 17 comments

Validation Checklist

  • [x] I confirm that this is a Kubeflow-related issue.
  • [x] I am reporting this in the appropriate repository.
  • [x] I have followed the Kubeflow installation guidelines.
  • [x] The issue report is detailed and includes version numbers where applicable.
  • [x] This issue pertains to Kubeflow development.
  • [x] I am available to work on this issue.
  • [x] You can join the CNCF Slack and access our meetings at the Kubeflow Community website. Our channel on the CNCF Slack is here #kubeflow-platform.

Version

master

Detailed Description

DckerHub seems to beending unauthenticated pulls from March 1, 2025.

We probably need to migrate all docker images to GHCR as soon as possible, possibly before 1.10 final is cut.

https://docs.docker.com/docker-hub/usage/pulls/

Steps to Reproduce

Pull too many images

Screenshots or Videos (Optional)

No response

juliusvonkohout avatar Feb 23 '25 10:02 juliusvonkohout

  • [ ] KFP
  • [ ] Katib
  • [ ] Manifests/platform
  • [ ] trainer
  • [ ] kserve
  • [ ] model registry

Some of them are already on GHCR according to the maintainers.

https://github.com/kubeflow/manifests/blob/master/hack/trivy_scan.py can give us all images. I'm each commit to master

@rimolive @tarekabouzeid

Here is the list

busybox:1.28
docker.io/istio/pilot:1.24.2
docker.io/istio/proxyv2:1.24.2
docker.io/kubeflow/model-registry-ui:v0.2.14
docker.io/kubeflowkatib/earlystopping-medianstop:v0.18.0-rc.0
docker.io/kubeflowkatib/enas-cnn-cifar10-cpu:v0.18.0-rc.0
docker.io/kubeflowkatib/file-metrics-collector:v0.18.0-rc.0
docker.io/kubeflowkatib/katib-controller:v0.18.0-rc.0
docker.io/kubeflowkatib/katib-db-manager:v0.18.0-rc.0
docker.io/kubeflowkatib/katib-ui:v0.18.0-rc.0
docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.18.0-rc.0
docker.io/kubeflowkatib/suggestion-darts:v0.18.0-rc.0
docker.io/kubeflowkatib/suggestion-enas:v0.18.0-rc.0
docker.io/kubeflowkatib/suggestion-goptuna:v0.18.0-rc.0
docker.io/kubeflowkatib/suggestion-hyperband:v0.18.0-rc.0
docker.io/kubeflowkatib/suggestion-hyperopt:v0.18.0-rc.0
docker.io/kubeflowkatib/suggestion-optuna:v0.18.0-rc.0
docker.io/kubeflowkatib/suggestion-pbt:v0.18.0-rc.0
docker.io/kubeflowkatib/suggestion-skopt:v0.18.0-rc.0
docker.io/kubeflowkatib/tfevent-metrics-collector:v0.18.0-rc.0
docker.io/kubeflownotebookswg/centraldashboard:v1.10.0-rc.1
docker.io/kubeflownotebookswg/jupyter-web-app:v1.10.0-rc.1
docker.io/kubeflownotebookswg/kfam:v1.10.0-rc.1
docker.io/kubeflownotebookswg/notebook-controller:v1.10.0-rc.1
docker.io/kubeflownotebookswg/poddefaults-webhook:v1.10.0-rc.1
docker.io/kubeflownotebookswg/profile-controller:v1.10.0-rc.1
docker.io/kubeflownotebookswg/pvcviewer-controller:v1.10.0-rc.1
docker.io/kubeflownotebookswg/tensorboard-controller:v1.10.0-rc.1
docker.io/kubeflownotebookswg/tensorboards-web-app:v1.10.0-rc.1
docker.io/kubeflownotebookswg/volumes-web-app:v1.10.0-rc.1
docker.io/seldonio/mlserver:1.5.0
gcr.io/cloudsql-docker/gce-proxy:1.25.0
gcr.io/knative-releases/knative.dev/net-istio/cmd/controller@sha256:e70bc675f97778da144157f125b3001124ba7a5903b85dab9e77776352fea1c7
gcr.io/knative-releases/knative.dev/net-istio/cmd/webhook@sha256:7d76a6d42d139ed53aae3ca2dfd600b1c776eb85a17af64dd1b604176a4b132a
gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:24c19cbee078925b91cd2e85082b581d53b218b410c083b1005dc06dc549b1d3
gcr.io/knative-releases/knative.dev/serving/cmd/autoscaler@sha256:5e9236452d89363957d4e7e249d57740a8fcd946aed23f8518d94962bf440250
gcr.io/knative-releases/knative.dev/serving/cmd/controller@sha256:5fb22b052e6bc98a1a6bbb68c0282ddb50744702acee6d83110302bc990666e9
gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:c61042001b1f21c5d06bdee9b42b5e4524e4370e09d4f46347226f06db29ba0f
gcr.io/knative-releases/knative.dev/serving/cmd/webhook@sha256:0fb5a4245aa4737d443658754464cd0a076de959fe14623fb9e9d31318ccce24
gcr.io/ml-pipeline/application-crd-controller:20231101
gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
gcr.io/ml-pipeline/mysql:8.0.26
gcr.io/ml-pipeline/workflow-controller:v3.4.17-license-compliance
gcr.io/tekton-releases/github.com/tektoncd/pipeline/cmd/controller:v0.53.2@sha256:2cab05747826e7c32e2c588f0fefd354e03f643bd33dbe20533eada00562e6b1
gcr.io/tekton-releases/github.com/tektoncd/pipeline/cmd/events:v0.53.2@sha256:0cf6f0be5319efdd8909ed8f987837d89146fd0632a744bf6d54bf83e5b13ca0
gcr.io/tekton-releases/github.com/tektoncd/pipeline/cmd/resolvers:v0.53.2@sha256:6578d145acd9cd288e501023429439334de15de8bd77af132c57a1d5f982e940
gcr.io/tekton-releases/github.com/tektoncd/pipeline/cmd/webhook:v0.53.2@sha256:1e8f8be3b51be378747b4589dde970582f50e1e69f59527f0a9aa7a75c5833e3
gcr.io/tfx-oss-public/ml_metadata_store_server:1.14.0
ghcr.io/dexidp/dex:v2.41.1
ghcr.io/kubeflow/kfp-api-server:2.4.0
ghcr.io/kubeflow/kfp-cache-deployer:2.4.0
ghcr.io/kubeflow/kfp-cache-server:2.4.0
ghcr.io/kubeflow/kfp-frontend:2.4.0
ghcr.io/kubeflow/kfp-inverse-proxy-agent:2.4.0
ghcr.io/kubeflow/kfp-metadata-envoy:2.4.0
ghcr.io/kubeflow/kfp-metadata-writer:2.4.0
ghcr.io/kubeflow/kfp-persistence-agent:2.4.0
ghcr.io/kubeflow/kfp-scheduled-workflow-controller:2.4.0
ghcr.io/kubeflow/kfp-viewer-crd-controller:2.4.0
ghcr.io/kubeflow/kfp-visualization-server:2.4.0
ghcr.io/metacontroller/metacontroller:v2.6.1
kserve/huggingfaceserver:v0.14.1
kserve/kserve-controller:v0.14.1
kserve/kserve-localmodel-controller:v0.14.1
kserve/kserve-localmodelnode-agent:v0.14.1
kserve/lgbserver:v0.14.1
kserve/models-web-app:v0.14.0-rc.0
kserve/paddleserver:v0.14.1
kserve/pmmlserver:v0.14.1
kserve/sklearnserver:v0.14.1
kserve/storage-initializer:v0.14.1
kserve/xgbserver:v0.14.1
kubeflow/model-registry-storage-initializer:latest
kubeflow/model-registry:v0.2.14
kubeflow/training-operator:v1-5170a36
mysql:8.0.29
mysql:8.0.3
mysql:8.0.39
nvcr.io/nvidia/tritonserver:23.05-py3
postgres:14.5-alpine
postgres:14.7-alpine3.17
python:3.9
pytorch/torchserve-kfs:0.9.0
quay.io/aipipeline/pipelineloop-controller:1.9.2
quay.io/aipipeline/pipelineloop-webhook:1.9.2
quay.io/aipipeline/tekton-exithandler-controller:2.0.5
quay.io/aipipeline/tekton-exithandler-webhook:2.0.5
quay.io/aipipeline/tekton-kfptask-controller:2.0.5
quay.io/aipipeline/tekton-kfptask-webhook:2.0.5
quay.io/brancz/kube-rbac-proxy:v0.13.1
quay.io/brancz/kube-rbac-proxy:v0.18.0
quay.io/brancz/kube-rbac-proxy:v0.8.0
tensorflow/serving:2.6.2

juliusvonkohout avatar Feb 23 '25 10:02 juliusvonkohout

I checked the manifests manually for OCI images hosted on Dockerhub that will then probably break for many users in March

  • [x] Istio: image: busybox:1.28 we should use registry.k8s.io/busybox as KFP does @juliusvonkohout
  • [x] Istio: docker.io/istio/proxyv2:1.24.2 and docker.io/istio/pilot:1.24.2 We can probaly use https://console.cloud.google.com/artifacts/docker/istio-release and https://console.cloud.google.com/artifacts/docker/istio-release/us/gcr.io/pilot?inv=1&invt=Abqa_w via istioctl profile dump default --set global.hub=gcr.io/istio-release > profile.yaml in https://github.com/kubeflow/manifests/blob/master/common/istio-1-24/README.md and the CNI version @juliusvonkohout @tarekabouzeid @akagami-harsh
  • [x] python:3.9 for PPC should anyway be updated to 3.12 in KFP multitenancy https://github.com/kubeflow/pipelines/pull/11669 @juliusvonkohout @hbelmiro @HumairAK
  • [x] all kubeflownotebookswg/ images @thesuperzapper
  • [ ] tensorboard tensorflow/tensorflow:2.5.1 @thesuperzapper
  • [ ] VOLUME_VIEWER_IMAGE filebrowser/filebrowser:v2.25.0 @thesuperzapper
  • [ ] docker.io/kubeflowkatib/ images and mysql:8.0.29 @andreyvelich @Electronic-Waste You can use the image that KFP uses gcr.io/ml-pipeline/mysql:8.0.26 and ask them to update it @hbelmiro @HumairAK
  • [ ] kubeflow/training-operator:v1-5170a36 @andreyvelich @Electronic-Waste
  • [ ] Spark @juliusvonkohout @vikas-saxena02
  • [ ] kserve kserve/ images and docker.io/seldonio/mlserver, tensorflow/serving, pytorch/torchserve-kfs @yuzisun @biswassri

@kubeflow/kubeflow-steering-committee

juliusvonkohout avatar Feb 24 '25 10:02 juliusvonkohout

According to @thesuperzapper they shifted the deadline to April first.

juliusvonkohout avatar Feb 25 '25 07:02 juliusvonkohout

@juliusvonkohout here's the official doc https://docs.docker.com/docker-hub/usage/ in case this helps.

"Starting April 1, 2025, all users with a Pro, Team, or Business subscription will have unlimited Docker Hub pulls with fair use. Unauthenticated users and users with a free Personal account have the following pull limits:

Unauthenticated users: 10 pulls/hour Authenticated users with a free account: 100 pulls/hour"

varodrig avatar Feb 27 '25 03:02 varodrig

I recommend the ECR mirror as replacements for the docker library, because this way we regularly get security updates for base images such as public.ecr.aws/docker/library/python:3.12, all self-build ones we can push to ghcr.

juliusvonkohout avatar Mar 03 '25 15:03 juliusvonkohout

See also https://gallery.ecr.aws/docker/

juliusvonkohout avatar Mar 03 '25 17:03 juliusvonkohout

There is quite some progres in https://github.com/kubeflow/manifests/issues/3010#issuecomment-2677977953

juliusvonkohout avatar Mar 20 '25 12:03 juliusvonkohout

Work in progress for sparkoperator... i will be raising the PR there in a day or 2

vikas-saxena02 avatar Mar 21 '25 23:03 vikas-saxena02

raised kubeflow/spark-operator#2483 . Corresponding issue in spark-operator repo is kubeflow/spark-operator#2480

I have added hold label to it as I am waiting for the maintainers of spark-operator to confirm the steps to test the change

vikas-saxena02 avatar Mar 22 '25 07:03 vikas-saxena02

@juliusvonkohout spark-operator has been taken care of.

vikas-saxena02 avatar Mar 30 '25 10:03 vikas-saxena02

@juliusvonkohout spark-operator has been taken care of.

do they have it in the latest release that we can synchronize? you can also do that with the sripts under /scripts as soon as there is a release.

juliusvonkohout avatar Mar 30 '25 18:03 juliusvonkohout

@juliusvonkohout spark-operator has been taken care of.

do they have it in the latest release that we can synchronize? you can also do that with the sripts under /scripts as soon as there is a release.

@juliusvonkohout I will have to check with them.

vikas-saxena02 avatar Mar 31 '25 10:03 vikas-saxena02

To ALL COMPONENT MAINTAINERS @andyatmiami created a script to mirror tags from docker.io to ghcr.io which we used to migrate the old vX.X.X tags of all the kubeflow/kubeflow images.

I suggest that we use this script to mirror the release tags of other components too:

  • It ensures we have full history of the release tags on GHCR
  • It lets users who cant upgrade immediately (or need to use an old version), access the images on GHCR

PS: If people need, I have a business DockerHub account that I can use to avoid getting rate-limited while getting the old tags, but as I only have write access on the GHCR images that Notebooks WG owns, so it might be hard for me to push.

thesuperzapper avatar Mar 31 '25 17:03 thesuperzapper

I recommend the ECR mirror as replacements for the docker library, because this way we regularly get security updates for base images such as public.ecr.aws/docker/library/python:3.12, all self-build ones we can push to ghcr.

It is not recommended to use public.ecr.aws mirrors, due to the special authentication method of aws, third-party companies can not deploy their own intranet cache based on aws mirrors, which may lead to difficulties in maintaining subsequent kubeflow versions. https://github.com/distribution/distribution/issues/4252 https://github.com/distribution/distribution/issues/4383

wnark avatar Jun 03 '25 09:06 wnark

I recommend the ECR mirror as replacements for the docker library, because this way we regularly get security updates for base images such as public.ecr.aws/docker/library/python:3.12, all self-build ones we can push to ghcr.

It is not recommended to use public.ecr.aws mirrors, due to the special authentication method of aws, third-party companies can not deploy their own intranet cache based on aws mirrors, which may lead to difficulties in maintaining subsequent kubeflow versions. distribution/distribution#4252 distribution/distribution#4383

Do you have an alternative?

juliusvonkohout avatar Jun 03 '25 10:06 juliusvonkohout

current OCI images on Dockerhub:

busybox:1.28 (Maybe Istio or KFP)

mysql:8.0.29  (KFP + Katib)
mysql:8.3.0 (KFP + Katib)

postgres:14.5-alpine (KFP)
postgres:14.7-alpine3.17 (KFP)

pytorch/torchserve-kfs:0.9.0 (Kserve)
tensorflow/serving:2.6.2 (Kserve)
docker.io/seldonio/mlserver:1.5.0
kserve/huggingfaceserver:v0.15.0
kserve/huggingfaceserver:v0.15.0-gpu
kserve/kserve-controller:v0.15.0
kserve/kserve-localmodel-controller:v0.15.0
kserve/lgbserver:v0.15.0
kserve/paddleserver:v0.15.0
kserve/pmmlserver:v0.15.0
kserve/sklearnserver:v0.15.0
kserve/storage-initializer:v0.15.0
kserve/xgbserver:v0.15.0

@vikas-saxena02 @biswassri @terrytangyuan what is the status with Kserve? Some of the kserve images also have massive CVEs see https://github.com/kubeflow/manifests/actions/runs/15414880021/job/43375235285 which are probably relevant for Kserves graduation. The main offenders are nvcr.io/nvidia/tritonserver:23.05-py3 and kserve/huggingfaceserver:v0.15.0-gpu.

Image

@vikas-saxena02 @biswassri can you check where busybox, postgres and mysql comes from upstream?

{
    "data": [
        {
            "image": "kserve/storage-initializer:v0.15.0",
            "severity_counts": {
                "LOW": 74,
                "MEDIUM": 33,
                "HIGH": 7,
                "CRITICAL": 2
            }
        },
        {
            "image": "kserve/paddleserver:v0.15.0",
            "severity_counts": {
                "LOW": 76,
                "MEDIUM": 33,
                "HIGH": 8,
                "CRITICAL": 2
            }
        },
        {
            "image": "kserve/huggingfaceserver:v0.15.0-gpu",
            "severity_counts": {
                "LOW": 168,
                "MEDIUM": 1202,
                "HIGH": 11,
                "CRITICAL": 4
            }
        },
        {
            "image": "pytorch/torchserve-kfs:0.9.0",
            "severity_counts": {
                "LOW": 233,
                "MEDIUM": 1683,
                "HIGH": 97,
                "CRITICAL": 8
            }
        },
        {
            "image": "ghcr.io/kserve/models-web-app:v0.14.0",
            "severity_counts": {
                "LOW": 74,
                "MEDIUM": 36,
                "HIGH": 5,
                "CRITICAL": 1
            }
        },
        {
            "image": "kserve/sklearnserver:v0.15.0",
            "severity_counts": {
                "LOW": 74,
                "MEDIUM": 33,
                "HIGH": 7,
                "CRITICAL": 2
            }
        },
        {
            "image": "docker.io/seldonio/mlserver:1.5.0",
            "severity_counts": {
                "LOW": 88,
                "MEDIUM": 123,
                "HIGH": 50,
                "CRITICAL": 2
            }
        },
        {
            "image": "nvcr.io/nvidia/tritonserver:23.05-py3",
            "severity_counts": {
                "LOW": 526,
                "MEDIUM": 3556,
                "HIGH": 123,
                "CRITICAL": 0
            }
        },
        {
            "image": "quay.io/brancz/kube-rbac-proxy:v0.18.0",
            "severity_counts": {
                "LOW": 1,
                "MEDIUM": 3,
                "HIGH": 1,
                "CRITICAL": 1
            }
        },
        {
            "image": "kserve/pmmlserver:v0.15.0",
            "severity_counts": {
                "LOW": 86,
                "MEDIUM": 90,
                "HIGH": 27,
                "CRITICAL": 2
            }
        },
        {
            "image": "kserve/xgbserver:v0.15.0",
            "severity_counts": {
                "LOW": 76,
                "MEDIUM": 33,
                "HIGH": 7,
                "CRITICAL": 2
            }
        },
        {
            "image": "kserve/lgbserver:v0.15.0",
            "severity_counts": {
                "LOW": 76,
                "MEDIUM": 33,
                "HIGH": 8,
                "CRITICAL": 2
            }
        },
        {
            "image": "tensorflow/serving:2.6.2",
            "severity_counts": {
                "LOW": 61,
                "MEDIUM": 40,
                "HIGH": 4,
                "CRITICAL": 0
            }
        },
        {
            "image": "kserve/huggingfaceserver:v0.15.0",
            "severity_counts": {
                "LOW": 141,
                "MEDIUM": 1198,
                "HIGH": 11,
                "CRITICAL": 4
            }
        },
        {
            "image": "kserve/kserve-controller:v0.15.0",
            "severity_counts": {
                "LOW": 0,
                "MEDIUM": 2,
                "HIGH": 1,
                "CRITICAL": 0
            }
        },
        {
            "image": "kserve/kserve-localmodel-controller:v0.15.0",
            "severity_counts": {
                "LOW": 0,
                "MEDIUM": 2,
                "HIGH": 0,
                "CRITICAL": 0
            }
        }
    ]
}

juliusvonkohout avatar Jun 03 '25 11:06 juliusvonkohout

what is the status with Kserve? Some of the kserve images also have massive CVEs

@juliusvonkohout I have a PR with all the changes in. I have one lgtm. Just waiting for the team's final review on it. I'll take care of the CVEs in a separate PR for the impacting images.

biswassri avatar Jun 12 '25 14:06 biswassri