mpi-operator icon indicating copy to clipboard operation
mpi-operator copied to clipboard

MPI example not working with volcano scheduler

Open yuyue9284 opened this issue 4 years ago • 4 comments

This example (https://github.com/kubeflow/mpi-operator/blob/master/examples/v1alpha2/tensorflow-benchmarks.yaml) will pending forever when using volcano scheduler.

Seems related to this issue: https://github.com/volcano-sh/volcano/issues/461

yuyue9284 avatar Jul 30 '20 09:07 yuyue9284

Issue-Label Bot is automatically applying the labels:

Label Probability
area/operator 0.51
kind/bug 0.89

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] avatar Jul 30 '20 09:07 issue-label-bot[bot]

@yuyue9284 Which Volcano version are you using? And could you post your mpi-operator deployment?

Now, volcano v1.0.0 did change the PodGroup CRD APIGroup to volcano.sh, so you may need use mpijob with v1 version which support that PodGroup version. And you need to set --gang-scheduling defined at here.

carmark avatar Jul 31 '20 01:07 carmark

@yuyue9284 Which Volcano version are you using? And could you post your mpi-operator deployment?

Now, volcano v1.0.0 did change the PodGroup CRD APIGroup to volcano.sh, so you may need use mpijob with v1 version which support that PodGroup version. And you need to set --gang-scheduling defined at here.

Hi @carmark , this is my deployment for mpi-operator,

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mpi-operator
  namespace: mpi-operator
  labels:
    app: mpi-operator
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mpi-operator
  template:
    metadata:
      labels:
        app: mpi-operator
    spec:
      serviceAccountName: mpi-operator
      containers:
      - name: mpi-operator
        image: mpioperator/mpi-operator:latest
        args: [
          "-alsologtostderr",
          "--kubectl-delivery-image",
          "mpioperator/kubectl-delivery:latest",
          "-gang-scheduling",
          "volcano"
        ]
        imagePullPolicy: Always

I'm using mpi-operator v1alpha2 with volcano v0.3, thanks.

yuyue9284 avatar Jul 31 '20 05:07 yuyue9284

Because it is missing resources in the launcher. Please check the code in volcano. Few of the resources will be seen as short.

divinerapier avatar Nov 18 '20 09:11 divinerapier