pytorch-operator icon indicating copy to clipboard operation
pytorch-operator copied to clipboard

Unlable to spawn PyTorchJob due to image alpine dependency of pytorch-operator

Open asahalyft opened this issue 3 years ago • 4 comments

He Team, I am trying to use the Pytorch Operator to spawn distributed Pytorch Jobs. I see the image mentioned in https://github.com/kubeflow/pytorch-operator/blob/6293efc19503078953acf04df03a1204fd265e35/manifests/kustomization.yaml#L13 to be 809251082950.dkr.ecr.us-west-2.amazonaws.com/pytorch-operator. However, that repo is not accessible from inside our network. So, instead I switched to gcr.io/kubeflow-images-public/pytorch-operator:latest

I cloned this pytorch-operator repo and generated the pytorch operator using kustomize build manifests/ | kubectl apply -f which generates the following yaml - I also customized the namespace.

apiVersion: v1
kind: Namespace
metadata:
  labels:
    kustomize.component: pytorch-operator
  name: pytorch-operator
---
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  labels:
    kustomize.component: pytorch-operator
  name: pytorchjobs.kubeflow.org
spec:
  additionalPrinterColumns:
  - JSONPath: .status.conditions[-1:].type
    name: State
    type: string
  - JSONPath: .metadata.creationTimestamp
    name: Age
    type: date
  group: kubeflow.org
  names:
    kind: PyTorchJob
    plural: pytorchjobs
    singular: pytorchjob
  scope: Namespaced
  subresources:
    status: {}
  validation:
    openAPIV3Schema:
      properties:
        spec:
          properties:
            pytorchReplicaSpecs:
              properties:
                Master:
                  properties:
                    replicas:
                      maximum: 1
                      minimum: 1
                      type: integer
                Worker:
                  properties:
                    replicas:
                      minimum: 1
                      type: integer
  versions:
  - name: v1
    served: true
    storage: true
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app: pytorch-operator
    kustomize.component: pytorch-operator
  name: pytorch-operator
  namespace: pytorch-operator
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  labels:
    app: pytorch-operator
    kustomize.component: pytorch-operator
  name: pytorch-operator
rules:
- apiGroups:
  - kubeflow.org
  resources:
  - pytorchjobs
  - pytorchjobs/status
  - pytorchjobs/finalizers
  verbs:
  - '*'
- apiGroups:
  - apiextensions.k8s.io
  resources:
  - customresourcedefinitions
  verbs:
  - '*'
- apiGroups:
  - ""
  resources:
  - pods
  - services
  - endpoints
  - events
  verbs:
  - '*'
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  labels:
    app: pytorch-operator
    kustomize.component: pytorch-operator
  name: pytorch-operator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: pytorch-operator
subjects:
- kind: ServiceAccount
  name: pytorch-operator
  namespace: pytorch-operator
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/port: "8443"
    prometheus.io/scrape: "true"
  labels:
    app: pytorch-operator
    kustomize.component: pytorch-operator
  name: pytorch-operator
  namespace: pytorch-operator
spec:
  ports:
  - name: monitoring-port
    port: 8443
    targetPort: 8443
  selector:
    kustomize.component: pytorch-operator
    name: pytorch-operator
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    kustomize.component: pytorch-operator
  name: pytorch-operator
  namespace: pytorch-operator
spec:
  replicas: 1
  selector:
    matchLabels:
      kustomize.component: pytorch-operator
      name: pytorch-operator
  template:
    metadata:
      labels:
        kustomize.component: pytorch-operator
        name: pytorch-operator
    spec:
      containers:
      - command:
        - /pytorch-operator.v1
        - --alsologtostderr
        - -v=1
        - --monitoring-port=8443
        env:
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        image: gcr.io/kubeflow-images-public/pytorch-operator:latest
        name: pytorch-operator
      serviceAccountName: pytorch-operator

I applied the above yaml and verified that the operator is running successfully

$ kubectl get pods -n pytorch-operator
NAME                                READY   STATUS    RESTARTS   AGE
pytorch-operator-6746dbbc89-sv2qw   1/1     Running   0          100m
$ 

I then apply the following yaml to create a Distributed PytorchJob.

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "pytorch-dist-mnist-nccl"
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
          labels:
            lyft.com/ml-platform: ""  
        spec:
          containers:
            - name: pytorch
              image: "OUR_AWS_ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/lyftlearnhorovod:8678853078c35bf1d003761a070389ca535a5d03"
              command: 
                - python
              args: 
                - "/mnt/user-home/distributed-training-exploration/pytorchjob_distributed_mnist.py"
                - "--backend"
                - "nccl"
                - "--epochs"
                - "2"
              env:
              - name: NCCL_DEBUG
                value: "INFO"
              - name: NCCL_SOCKET_IFNAME
                value: "eth0"
              resources:
                limits:
                  nvidia.com/gpu: 1
              volumeMounts:
              - mountPath: /mnt/user-home
                name: nfs
          volumes:
          - name: nfs
            persistentVolumeClaim:
              claimName: asaha
          tolerations: 
            - key: lyft.net/gpu
              operator: Equal
              value: dedicated
              effect: NoSchedule
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
          labels:
            lyft.com/ml-platform: ""  
        spec:
          containers:
            - name: pytorch
              image: "OUR_AWS_ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/lyftlearnhorovod:8678853078c35bf1d003761a070389ca535a5d03"
              command: 
                - python
              args: 
                - "/mnt/user-home/distributed-training-exploration/pytorchjob_distributed_mnist.py"
                - "--backend"
                - "nccl"
                - "--epochs"
                - "2"
              env:
              - name: NCCL_DEBUG
                value: "INFO"
              - name: NCCL_SOCKET_IFNAME
                value: "eth0"
              resources:
                limits:
                  nvidia.com/gpu: 1
              volumeMounts:
              - mountPath: /mnt/user-home
                name: nfs
          volumes:
          - name: nfs
            persistentVolumeClaim:
              claimName: asaha
          tolerations: 
            - key: lyft.net/gpu
              operator: Equal
              value: dedicated
              effect: NoSchedule

I see the worker pods failing with ImagePullBackOff Errors Failed to pull image "alpine:3.10": rpc error: code = Unknown desc = Error reading manifest 3.10 in OUR_AWS_ACCOUNT.dkr.ecr.us-west-2.amazonaws.com/alpine: name unknown: The repository with name 'alpine' does not exist in the registry with id 'OUR_AWS_ACCOUNT'

15m         Normal    BackOff                   pod/pytorch-dist-mnist-nccl-worker-0                   Back-off pulling image "alpine:3.10"
18m         Warning   Failed                    pod/pytorch-dist-mnist-nccl-worker-0                   Error: ImagePullBackOff
10s         Normal    Scheduled                 pod/pytorch-dist-mnist-nccl-worker-0                   Successfully assigned asaha/pytorch-dist-mnist-nccl-worker-0 to ip-10-44-108-79.ec2.internal
9s          Normal    Pulling                   pod/pytorch-dist-mnist-nccl-worker-0                   Pulling image "alpine:3.10"
8s          Warning   Failed                    pod/pytorch-dist-mnist-nccl-worker-0                   Failed to pull image "alpine:3.10": rpc error: code = Unknown desc = Error reading manifest 3.10 in <OUR_AWS_ACCOUNT>.dkr.ecr.us-west-2.amazonaws.com/alpine: name unknown: The repository with name 'alpine' does not exist in the registry with id '<OUR_AWS_ACCOUNT>'
8s          Warning   Failed                    pod/pytorch-dist-mnist-nccl-worker-0                   Error: ErrImagePull
7s          Normal    BackOff                   pod/pytorch-dist-mnist-nccl-worker-0                   Back-off pulling image "alpine:3.10"
20m         Normal    SuccessfulCreatePod       pytorchjob/pytorch-dist-mnist-nccl                     Created pod: pytorch-dist-mnist-nccl-master-0

Since, the Docker images are fully materialized why would it fail looking for alpine:3.10?

asahalyft avatar Feb 11 '21 00:02 asahalyft