mpi-operator
mpi-operator copied to clipboard
mpijob using hostnetwork error
hi, all
I want to use hostnetwork when submit mpijob to improve training performance. The yaml file is as below:
apiVersion: kubeflow.org/v1alpha2
kind: MPIJob
metadata:
annotations:
monitoring.netease.com/enable-grafana-dashboard: "true"
generateName: test-mpijob
generation: 2
labels:
fairing-deployer: mpijob
fairing-id: d7aaecf2-7e2e-11ea-8269-0a580ab29d87
kubeflow.netease.com/userid: huting3
namespace: ai-test
spec:
activeDeadlineSeconds: 3600
backoffLimit: 1
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
metadata:
annotations:
monitoring.netease.com/enable-grafana-dashboard: "true"
sidecar.istio.io/inject: "false"
creationTimestamp: null
labels:
fairing-deployer: mpijob
fairing-id: d7aaecf2-7e2e-11ea-8269-0a580ab29d87
kubeflow.netease.com/userid: huting3
name: fairing-deployer
spec:
hostNetwork: "true"
dnsPolicy: ClusterFirstWithHostNet
containers:
- command:
- mpirun
- --allow-run-as-root
- -np
- "2"
- -bind-to
- none
- -map-by
- slot
- -x
- NCCL_DEBUG=INFO
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- -mca
- pml
- ob1
- -mca
- btl
- ^openib
- python
- /app/boot.py
env:
- name: FAIRING_RUNTIME
value: "1"
image: hub-inner.cn-east-p1.netease.com/deeplearning/fairing-job:8AB586D0
name: mpi
resources:
limits:
memory: 998579896320m
requests:
cpu: "1"
securityContext:
runAsUser: 0
volumeMounts:
- mountPath: /data
name: fairing-volume-data-huting3
workingDir: /app/
imagePullSecrets:
- name: hubinnercneastp1neteasecomdeeplearningstaffk8sai01serviceneteasecom
restartPolicy: Never
volumes:
- name: fairing-volume-data-huting3
persistentVolumeClaim:
claimName: data-huting3
Worker:
replicas: 2
template:
metadata:
annotations:
monitoring.netease.com/enable-grafana-dashboard: "true"
sidecar.istio.io/inject: "false"
labels:
fairing-deployer: mpijob
fairing-id: d7aaecf2-7e2e-11ea-8269-0a580ab29d87
kubeflow.netease.com/userid: huting3
name: fairing-deployer
spec:
hostNetwork: "true"
dnsPolicy: ClusterFirstWithHostNet
containers:
- env:
- name: FAIRING_RUNTIME
value: "1"
image: hub-inner.cn-east-p1.netease.com/deeplearning/fairing-job:8AB586D0
name: mpi
resources:
limits:
memory: 6002216796160m
nvidia.com/gpu: "1"
requests:
cpu: "4"
securityContext:
runAsUser: 0
volumeMounts:
- mountPath: /data
name: fairing-volume-data-huting3
workingDir: /app/
restartPolicy: Never
volumes:
- name: fairing-volume-data-huting3
persistentVolumeClaim:
claimName: data-huting3
slotsPerWorker: 1
but got error as below:
--------------------------------------------------------------------------
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.
This attempted connection will be ignored; your MPI job may or may not
continue properly.
Local host: pri2-ainode28
PID: 30
--------------------------------------------------------------------------
what is the problem? any hints is good to me
@suluner Could you please attach more logs? And what's your network environment, RoCE, IB or others? It will be better to provide your result of ip a.
@carmark Thanks for your reply. My network environment is just ethernet. And when I add the following parameter to the mpirun command, it works well.
"-mca",
"btl_tcp_if_include",
"bond0.1200"
The bond0.1200 is the network device on the host machine.