kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Bug] Image Pull Error but rayjob shows "initializing"

Open ByronHsu opened this issue 1 year ago • 2 comments
trafficstars

Search before asking

  • [X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

When users specify a bad docker image, the status is not reflected on rayjob though pods show ImagePullBackOff.

pod status

rayjob-sample-raycluster-s2p7k-head-7tvcr                 0/1     ErrImagePull            0          41m
rayjob-sample-raycluster-s2p7k-worker-small-group-k4fbs   0/1     Init:ImagePullBackOff   0          41m

ray job status

rayjob-sample                            Initializing        2024-03-13T20:27:15Z              42m

ray cluster status

rayjob-sample-raycluster-s2p7k               1                                     400m   0        0               42m

Reproduction script

deploy the following ray job

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample
spec:
  entrypoint: python /home/ray/samples/sample_code.py
  # shutdownAfterJobFinishes specifies whether the RayCluster should be deleted after the RayJob finishes. Default is false.
  # shutdownAfterJobFinishes: false

  # ttlSecondsAfterFinished specifies the number of seconds after which the RayCluster will be deleted after the RayJob finishes.
  # ttlSecondsAfterFinished: 10

  # RuntimeEnvYAML represents the runtime environment configuration provided as a multi-line YAML string.
  # See https://docs.ray.io/en/latest/ray-core/handling-dependencies.html for details.
  # (New in KubeRay version 1.0.)
  runtimeEnvYAML: |
    pip:
      - requests==2.26.0
      - pendulum==2.1.2
    env_vars:
      counter_name: "test_counter"

  # Suspend specifies whether the RayJob controller should create a RayCluster instance.
  # If a job is applied with the suspend field set to true, the RayCluster will not be created and we will wait for the transition to false.
  # If the RayCluster is already created, it will be deleted. In the case of transition to false, a new RayCluste rwill be created.
  # suspend: false

  # rayClusterSpec specifies the RayCluster instance to be created by the RayJob controller.
  rayClusterSpec:
    rayVersion: '2.9.0' # should match the Ray version in the image of the containers
    # Ray head pod template
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams:
        dashboard-host: '0.0.0.0'
      #pod template
      template:
        spec:
          containers:
            - name: ray-head
              image: not-exist-image:0.0.0
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: "1"
                requests:
                  cpu: "200m"
              volumeMounts:
                - mountPath: /home/ray/samples
                  name: code-sample
          volumes:
            # You set volumes at the Pod level, then mount them into containers inside that Pod
            - name: code-sample
              configMap:
                # Provide the name of the ConfigMap you want to mount.
                name: ray-job-code-sample
                # An array of keys from the ConfigMap to create as files
                items:
                  - key: sample_code.py
                    path: sample_code.py
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        # The `rayStartParams` are used to configure the `ray start` command.
        # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
        # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
        rayStartParams: {}
        #pod template
        template:
          spec:
            containers:
              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: not-exist-image:0.0.0
                lifecycle:
                  preStop:
                    exec:
                      command: [ "/bin/sh","-c","ray stop" ]
                resources:
                  limits:
                    cpu: "1"
                  requests:
                    cpu: "200m"
  # SubmitterPodTemplate is the template for the pod that will run the `ray job submit` command against the RayCluster.
  # If SubmitterPodTemplate is specified, the first container is assumed to be the submitter container.
  submitterPodTemplate:
    spec:
      restartPolicy: Never
      containers:
        - name: my-custom-rayjob-submitter-pod
          image: not-exist-image:0.0.0
          # If Command is not specified, the correct command will be supplied at runtime using the RayJob spec `entrypoint` field.
          # Specifying Command is not recommended.
          # command: ["sh", "-c", "ray job submit --address=http://$RAY_DASHBOARD_ADDRESS --submission-id=$RAY_JOB_SUBMISSION_ID -- echo hello world"]
          resources:
            limits:
              cpu: "1"
            requests:
              cpu: "200m"


######################Ray code sample#################################
# this sample is from https://docs.ray.io/en/latest/cluster/job-submission.html#quick-start-example
# it is mounted into the container and executed to show the Ray job at work
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  sample_code.py: |

    import ray
    import time; time.sleep(3600)

    @ray.remote
    def f(x):
        print(f"ray task get {x}")
        return x * x

    ray.init()

    futures = [f.remote(i) for i in range(2)]
    l = ray.get(futures)
    print(l)

    # ray.shutdown()

Anything else

No response

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

ByronHsu avatar Mar 13 '24 21:03 ByronHsu

This is beyond the scope of KubeRay. Users should ensure they are using the correct images. Additionally, users can utilize activeDeadlineSeconds to prevent RayJobs from running indefinitely. In addition, the K8s Job status also doesn't have the information about image pull error.

kevin85421 avatar Mar 14 '24 00:03 kevin85421

Similarly, if "ray" is not found in the image, the status will be stuck at "initializing"

image

ByronHsu avatar Mar 14 '24 21:03 ByronHsu

Discussed offline. This is beyond the scope of KubeRay. Feel free to reopen if you have further thoughts.

kevin85421 avatar Mar 25 '24 00:03 kevin85421