mpi-operator
mpi-operator copied to clipboard
Performance difference executing mpirun from launcher vs. inside pod
Hello,
I'm seeing a 20% drop in performance when executing mpirun with the mpi-operator through the launcher vs. directly running mpirun inside the pod. I'm very perplexed on how this can happen and if there are any mitigation strategies.
The model is running on 1 Worker with 8 GPUs (A100) on GCP.
MPIJob:
apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"kubeflow.org/v2beta1","kind":"MPIJob","metadata":{"annotations":{},"name":"tensorflow-horovod-1","namespace":"default"},"spec":{"mpiReplicaSpecs":{"Launcher":{"replicas":1,"template":{"spec":{"containers":[{"command":["mpirun","--allow-run-as-root","-np","8","-bind-to","none","-map-by","slot","-x","NCCL_DEBUG=INFO","-x","LD_LIBRARY_PATH","-x","PATH","-mca","pml","ob1","-mca","btl","^openib","/app/bazel-out/k8-opt/bin/learning/tensorcraft/model2d/model2d_keras_hv","--","hydra_config=learning/tensorcraft/model2d/keras_configs/local_2d_config.yaml"],"env":[{"name":"USE_GCE_CHECK","value":"true"}],"image":"gcr.io/kkolli/2/horovod","name":"mpi-launcher"}],"nodeSelector":{"cloud.google.com/gke-preemptible":"false"},"serviceAccountName":"standard-sa","tolerations":[{"effect":"NoSchedule","key":"nvidia.com/gpu","operator":"Equal","value":"present"}]}}},"Worker":{"replicas":1,"template":{"spec":{"containers":[{"env":[{"name":"USE_GCE_CHECK","value":"true"}],"image":"gcr.io/kkolli/2/horovod","name":"mpi-worker","resources":{"limits":{"nvidia.com/gpu":8}}}],"nodeSelector":{"cloud.google.com/gke-preemptible":"false"},"serviceAccountName":"standard-sa","tolerations":[{"effect":"NoSchedule","key":"nvidia.com/gpu","operator":"Equal","value":"present"}]}}}},"runPolicy":{"cleanPodPolicy":"Running"},"slotsPerWorker":8}}
creationTimestamp: "2023-01-18T02:56:05Z"
generation: 1
name: tensorflow-horovod-1
namespace: default
resourceVersion: "526405971"
uid: 4696215c-7f79-4d1f-af07-336a0bfc51d6
spec:
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- command:
- mpirun
- --allow-run-as-root
- -np
- "8"
- -bind-to
- none
- -map-by
- slot
- -x
- NCCL_DEBUG=INFO
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- -mca
- pml
- ob1
- -mca
- btl
- ^openib
- /app/bazel-out/k8-opt/bin/learning/tensorcraft/model2d/model2d_keras_hv
- --
- hydra_config=learning/tensorcraft/model2d/keras_configs/local_2d_config.yaml
env:
- name: USE_GCE_CHECK
value: "true"
image: gcr.io/kkolli/2/horovod
name: mpi-launcher
nodeSelector:
cloud.google.com/gke-preemptible: "false"
serviceAccountName: standard-sa
Worker:
replicas: 1
template:
spec:
containers:
- env:
- name: USE_GCE_CHECK
value: "true"
image: gcr.io/kkolli/2/horovod
name: mpi-worker
resources:
limits:
nvidia.com/gpu: 8
nodeSelector:
cloud.google.com/gke-preemptible: "false"
serviceAccountName: standard-sa
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Equal
value: present
runPolicy:
cleanPodPolicy: Running
slotsPerWorker: 8
status:
conditions:
- lastTransitionTime: "2023-01-18T02:56:05Z"
lastUpdateTime: "2023-01-18T02:56:05Z"
message: MPIJob default/tensorflow-horovod-1 is created.
reason: MPIJobCreated
status: "True"
type: Created
- lastTransitionTime: "2023-01-18T02:56:08Z"
lastUpdateTime: "2023-01-18T02:56:08Z"
message: MPIJob default/tensorflow-horovod-1 is running.
reason: MPIJobRunning
status: "True"
type: Running
replicaStatuses:
Launcher:
active: 1
Worker:
active: 1
startTime: "2023-01-18T02:56:05Z"
MPIRun command after exec into a pod:
mpirun -np 8 --allow-run-as-root -H localhost:8 bazel-out/k8-opt/bin/learning/tensorcraft/model2d/model2d_keras_hv -- hydra_config=learning/tensorcraft/model2d/keras_configs/local_2d_config.yaml
Here is what I tested already:
- Verified all pods are within the same zone & region and has no effects on steps / second
- Removed all the -mca configurations and has no effects on steps / second
Is there something simple i'm missing on why the performance through the launcher is so degraded? 😄
What did you do be able to run exec? Put a sleep in the launcher?
My guess would be that when you are going into the pod and running exec, you are doing so after all pods are ready. So you are not considering the time it takes all workers to become ready.
Also, have you tried the v2 controller?
oh, another thing that could be happening (I'm not sure, because I'm not fully familiar with how the v1 controller configures the hostnames):
is it possible that your exec experiment run the job locally in the pod? As opposed to having it driven from the launcher, but running each step in the worker.
But TBH, using MPI with one worker doesn't make much sense. The whole point is to distribute work among multiple nodes. If your plan is to have a single node, you are better off creating a plain pod or using something like OpenMP :)
What did you do be able to run exec? Put a sleep in the launcher?
Similar idea, I modified the launcher spec so its not schedulable and it just keeps retrying. Then while that happens, I exec and do a mpirun.
My guess would be that when you are going into the pod and running exec, you are doing so after all pods are ready. So you are not considering the time it takes all workers to become ready.
When calculating time difference, I'm never taking wall-time into account. I'm actually seeing the difference in the models steps per second between the two runs. When running from localhost, the models steps per second are 20% higher.
Also, have you tried the v2 controller?
I have applied the operator from v2beta1 directory. So I believe i'm using the latest one?
is it possible that your exec experiment run the job locally in the pod? As opposed to having it driven from the launcher, but running each step in the worker
Yes this is exactly what happens. When I exec into the pod and run it locally, it's all happening within the pod. When done through the launcher, are you saying each gradient update goes back to the launcher to average over the wire? Is that fundamentally the bottleneck here? I assumed launcher was just a watcher for stdout and wasn't acting like a parameter server.
But TBH, using MPI with one worker doesn't make much sense. The whole point is to distribute work among multiple nodes. If your plan is to have a single node, you are better off creating a plain pod or using something like OpenMP :)
That's true, but the models we are running are pretty diverse in their specs. Some just need a 1x8 (1 worker with 8 GPUs) while others need something like 16x8 and having a unified deployment scheme would've been nice.
However, I have done a 8x8 experiment as well using GCP VMs in a pool. Comparing 8x8 vs. MPI Operator 8x8 + 1 launcher, the latter is 6% slower than a static group of VMs on GCP. I'm assuming that the difference is again explained by workers trying to synchronize gradients with the launcher pod?
I have applied the operator from v2beta1 directory.
apiVersion: kubeflow.org/v1
This confused me, but maybe it's just the client that you are using.
When done through the launcher, are you saying each gradient update goes back to the launcher to average over the wire? Is that fundamentally the bottleneck here? I assumed launcher was just a watcher for stdout and wasn't acting like a parameter server.
I'm not an expert of how MPI works, and it might depend on the MPI calls that you use, but there would be some over the wire comms.
However, I have done a 8x8 experiment as well using GCP VMs in a pool. Comparing 8x8 vs. MPI Operator 8x8 + 1 launcher, the latter is 6% slower than a static group of VMs on GCP. I'm assuming that the difference is again explained by workers trying to synchronize gradients with the launcher pod?
That is a more relevant point of reference, and it sounds much better than 20% :) We should focus on that one.
In general, k8s networking would add some overhead on top of regular networking. There are some tweaks you can do on nodes to improve the situation, but that's highly dependent on your cloud provider.
I'm not an expert of how MPI works, and it might depend on the MPI calls that you use, but there would be some over the wire comms.
Does this mean that the launcher doesn't do anything except typical MPI operations that keep track of workers with the current mpi-operator? Meaning, the difference is all correlated to MPI implementation and nothing on top is added here?
In general, k8s networking would add some overhead on top of regular networking. There are some tweaks you can do on nodes to improve the situation, but that's highly dependent on your cloud provider.
Could you provide some things to look out for? I'll check and try to make changes that are applicable to GCP. I'm still a little surprised that 6% difference can occur comparing a static list of hosts vs. a k8 configuration on pods.
Meaning, the difference is all correlated to MPI implementation and nothing on top is added here?
Correct, mpi-operator just uses the MPI implementation.
Could you provide some things to look out for?
Mind contacting me over the k8s slack?
@kkolli @alculquicondor, could you confirm if we have had the opportunity to evaluate or test this specific case scenario? It's crucial to note that the operator should not be directly influencing performance in this context.
It shouldn't. It all depends on how you configure the networking on your nodes.