mpi-operator icon indicating copy to clipboard operation
mpi-operator copied to clipboard

mpijob using hostnetwork error

Open suluner opened this issue 5 years ago • 2 comments

hi, all

I want to use hostnetwork when submit mpijob to improve training performance. The yaml file is as below:

apiVersion: kubeflow.org/v1alpha2
kind: MPIJob
metadata:
  annotations:
    monitoring.netease.com/enable-grafana-dashboard: "true"
  generateName: test-mpijob
  generation: 2
  labels:
    fairing-deployer: mpijob
    fairing-id: d7aaecf2-7e2e-11ea-8269-0a580ab29d87
    kubeflow.netease.com/userid: huting3
  namespace: ai-test
spec:
  activeDeadlineSeconds: 3600
  backoffLimit: 1
  cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        metadata:
          annotations:
            monitoring.netease.com/enable-grafana-dashboard: "true"
            sidecar.istio.io/inject: "false"
          creationTimestamp: null
          labels:
            fairing-deployer: mpijob
            fairing-id: d7aaecf2-7e2e-11ea-8269-0a580ab29d87
            kubeflow.netease.com/userid: huting3
          name: fairing-deployer
        spec:
          hostNetwork: "true"
          dnsPolicy: ClusterFirstWithHostNet
          containers:
          - command:
            - mpirun
            - --allow-run-as-root
            - -np
            - "2"
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - NCCL_DEBUG=INFO
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - python
            - /app/boot.py
            env:
            - name: FAIRING_RUNTIME
              value: "1"
            image: hub-inner.cn-east-p1.netease.com/deeplearning/fairing-job:8AB586D0
            name: mpi
            resources:
              limits:
                memory: 998579896320m
              requests:
                cpu: "1"
            securityContext:
              runAsUser: 0
            volumeMounts:
            - mountPath: /data
              name: fairing-volume-data-huting3
            workingDir: /app/
          imagePullSecrets:
          - name: hubinnercneastp1neteasecomdeeplearningstaffk8sai01serviceneteasecom
          restartPolicy: Never
          volumes:
          - name: fairing-volume-data-huting3
            persistentVolumeClaim:
              claimName: data-huting3
    Worker:
      replicas: 2
      template:
        metadata:
          annotations:
            monitoring.netease.com/enable-grafana-dashboard: "true"
            sidecar.istio.io/inject: "false"
          labels:
            fairing-deployer: mpijob
            fairing-id: d7aaecf2-7e2e-11ea-8269-0a580ab29d87
            kubeflow.netease.com/userid: huting3
          name: fairing-deployer
        spec:
          hostNetwork: "true"
          dnsPolicy: ClusterFirstWithHostNet
          containers:
          - env:
            - name: FAIRING_RUNTIME
              value: "1"
            image: hub-inner.cn-east-p1.netease.com/deeplearning/fairing-job:8AB586D0
            name: mpi
            resources:
              limits:
                memory: 6002216796160m
                nvidia.com/gpu: "1"
              requests:
                cpu: "4"
            securityContext:
              runAsUser: 0
            volumeMounts:
            - mountPath: /data
              name: fairing-volume-data-huting3
            workingDir: /app/
          restartPolicy: Never
          volumes:
          - name: fairing-volume-data-huting3
            persistentVolumeClaim:
              claimName: data-huting3
  slotsPerWorker: 1

but got error as below:

--------------------------------------------------------------------------
WARNING:  Open  MPI  accepted  a  TCP  connection  from  what  appears  to  be  a
another  Open  MPI  process  but  cannot  find  a  corresponding  process
entry  for  that  peer.

This  attempted  connection  will  be  ignored;  your  MPI  job  may  or  may  not
continue  properly.

    Local  host:  pri2-ainode28
    PID:                30
--------------------------------------------------------------------------

what is the problem? any hints is good to me

suluner avatar Apr 14 '20 10:04 suluner

@suluner Could you please attach more logs? And what's your network environment, RoCE, IB or others? It will be better to provide your result of ip a.

carmark avatar Apr 15 '20 07:04 carmark

@carmark Thanks for your reply. My network environment is just ethernet. And when I add the following parameter to the mpirun command, it works well.

"-mca",
"btl_tcp_if_include",
"bond0.1200"

The bond0.1200 is the network device on the host machine.

suluner avatar Apr 16 '20 01:04 suluner