mpi-operator Performance difference executing mpirun from launcher vs. inside pod

Hello,

I'm seeing a 20% drop in performance when executing mpirun with the mpi-operator through the launcher vs. directly running mpirun inside the pod. I'm very perplexed on how this can happen and if there are any mitigation strategies.

The model is running on 1 Worker with 8 GPUs (A100) on GCP.

MPIJob:

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kubeflow.org/v2beta1","kind":"MPIJob","metadata":{"annotations":{},"name":"tensorflow-horovod-1","namespace":"default"},"spec":{"mpiReplicaSpecs":{"Launcher":{"replicas":1,"template":{"spec":{"containers":[{"command":["mpirun","--allow-run-as-root","-np","8","-bind-to","none","-map-by","slot","-x","NCCL_DEBUG=INFO","-x","LD_LIBRARY_PATH","-x","PATH","-mca","pml","ob1","-mca","btl","^openib","/app/bazel-out/k8-opt/bin/learning/tensorcraft/model2d/model2d_keras_hv","--","hydra_config=learning/tensorcraft/model2d/keras_configs/local_2d_config.yaml"],"env":[{"name":"USE_GCE_CHECK","value":"true"}],"image":"gcr.io/kkolli/2/horovod","name":"mpi-launcher"}],"nodeSelector":{"cloud.google.com/gke-preemptible":"false"},"serviceAccountName":"standard-sa","tolerations":[{"effect":"NoSchedule","key":"nvidia.com/gpu","operator":"Equal","value":"present"}]}}},"Worker":{"replicas":1,"template":{"spec":{"containers":[{"env":[{"name":"USE_GCE_CHECK","value":"true"}],"image":"gcr.io/kkolli/2/horovod","name":"mpi-worker","resources":{"limits":{"nvidia.com/gpu":8}}}],"nodeSelector":{"cloud.google.com/gke-preemptible":"false"},"serviceAccountName":"standard-sa","tolerations":[{"effect":"NoSchedule","key":"nvidia.com/gpu","operator":"Equal","value":"present"}]}}}},"runPolicy":{"cleanPodPolicy":"Running"},"slotsPerWorker":8}}
  creationTimestamp: "2023-01-18T02:56:05Z"
  generation: 1
  name: tensorflow-horovod-1
  namespace: default
  resourceVersion: "526405971"
  uid: 4696215c-7f79-4d1f-af07-336a0bfc51d6
spec:
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - command:
            - mpirun
            - --allow-run-as-root
            - -np
            - "8"
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - NCCL_DEBUG=INFO
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - /app/bazel-out/k8-opt/bin/learning/tensorcraft/model2d/model2d_keras_hv
            - --
            - hydra_config=learning/tensorcraft/model2d/keras_configs/local_2d_config.yaml
            env:
            - name: USE_GCE_CHECK
              value: "true"
            image: gcr.io/kkolli/2/horovod
            name: mpi-launcher
          nodeSelector:
            cloud.google.com/gke-preemptible: "false"
          serviceAccountName: standard-sa
    Worker:
      replicas: 1
      template:
        spec:
          containers:
          - env:
            - name: USE_GCE_CHECK
              value: "true"
            image: gcr.io/kkolli/2/horovod
            name: mpi-worker
            resources:
              limits:
                nvidia.com/gpu: 8
          nodeSelector:
            cloud.google.com/gke-preemptible: "false"
          serviceAccountName: standard-sa
          tolerations:
          - effect: NoSchedule
            key: nvidia.com/gpu
            operator: Equal
            value: present
  runPolicy:
    cleanPodPolicy: Running
  slotsPerWorker: 8
status:
  conditions:
  - lastTransitionTime: "2023-01-18T02:56:05Z"
    lastUpdateTime: "2023-01-18T02:56:05Z"
    message: MPIJob default/tensorflow-horovod-1 is created.
    reason: MPIJobCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2023-01-18T02:56:08Z"
    lastUpdateTime: "2023-01-18T02:56:08Z"
    message: MPIJob default/tensorflow-horovod-1 is running.
    reason: MPIJobRunning
    status: "True"
    type: Running
  replicaStatuses:
    Launcher:
      active: 1
    Worker:
      active: 1
  startTime: "2023-01-18T02:56:05Z"

MPIRun command after exec into a pod:

mpirun -np 8 --allow-run-as-root -H localhost:8 bazel-out/k8-opt/bin/learning/tensorcraft/model2d/model2d_keras_hv -- hydra_config=learning/tensorcraft/model2d/keras_configs/local_2d_config.yaml

Here is what I tested already:

Verified all pods are within the same zone & region and has no effects on steps / second
Removed all the -mca configurations and has no effects on steps / second

Is there something simple i'm missing on why the performance through the launcher is so degraded? 😄

Jan 18 '23 16:01 kkolli

What did you do be able to run exec? Put a sleep in the launcher?

My guess would be that when you are going into the pod and running exec, you are doing so after all pods are ready. So you are not considering the time it takes all workers to become ready.

Also, have you tried the v2 controller?

Jan 18 '23 16:01 alculquicondor

oh, another thing that could be happening (I'm not sure, because I'm not fully familiar with how the v1 controller configures the hostnames):

is it possible that your exec experiment run the job locally in the pod? As opposed to having it driven from the launcher, but running each step in the worker.

But TBH, using MPI with one worker doesn't make much sense. The whole point is to distribute work among multiple nodes. If your plan is to have a single node, you are better off creating a plain pod or using something like OpenMP :)

Jan 18 '23 16:01 alculquicondor

What did you do be able to run exec? Put a sleep in the launcher?

Similar idea, I modified the launcher spec so its not schedulable and it just keeps retrying. Then while that happens, I exec and do a mpirun.

My guess would be that when you are going into the pod and running exec, you are doing so after all pods are ready. So you are not considering the time it takes all workers to become ready.

When calculating time difference, I'm never taking wall-time into account. I'm actually seeing the difference in the models steps per second between the two runs. When running from localhost, the models steps per second are 20% higher.

Also, have you tried the v2 controller?

I have applied the operator from v2beta1 directory. So I believe i'm using the latest one?

is it possible that your exec experiment run the job locally in the pod? As opposed to having it driven from the launcher, but running each step in the worker

Yes this is exactly what happens. When I exec into the pod and run it locally, it's all happening within the pod. When done through the launcher, are you saying each gradient update goes back to the launcher to average over the wire? Is that fundamentally the bottleneck here? I assumed launcher was just a watcher for stdout and wasn't acting like a parameter server.

But TBH, using MPI with one worker doesn't make much sense. The whole point is to distribute work among multiple nodes. If your plan is to have a single node, you are better off creating a plain pod or using something like OpenMP :)

That's true, but the models we are running are pretty diverse in their specs. Some just need a 1x8 (1 worker with 8 GPUs) while others need something like 16x8 and having a unified deployment scheme would've been nice.

However, I have done a 8x8 experiment as well using GCP VMs in a pool. Comparing 8x8 vs. MPI Operator 8x8 + 1 launcher, the latter is 6% slower than a static group of VMs on GCP. I'm assuming that the difference is again explained by workers trying to synchronize gradients with the launcher pod?

Jan 18 '23 17:01 kkolli

I have applied the operator from v2beta1 directory.

apiVersion: kubeflow.org/v1

This confused me, but maybe it's just the client that you are using.

When done through the launcher, are you saying each gradient update goes back to the launcher to average over the wire? Is that fundamentally the bottleneck here? I assumed launcher was just a watcher for stdout and wasn't acting like a parameter server.

I'm not an expert of how MPI works, and it might depend on the MPI calls that you use, but there would be some over the wire comms.

However, I have done a 8x8 experiment as well using GCP VMs in a pool. Comparing 8x8 vs. MPI Operator 8x8 + 1 launcher, the latter is 6% slower than a static group of VMs on GCP. I'm assuming that the difference is again explained by workers trying to synchronize gradients with the launcher pod?

That is a more relevant point of reference, and it sounds much better than 20% :) We should focus on that one.

In general, k8s networking would add some overhead on top of regular networking. There are some tweaks you can do on nodes to improve the situation, but that's highly dependent on your cloud provider.

Jan 18 '23 17:01 alculquicondor

I'm not an expert of how MPI works, and it might depend on the MPI calls that you use, but there would be some over the wire comms.

Does this mean that the launcher doesn't do anything except typical MPI operations that keep track of workers with the current mpi-operator? Meaning, the difference is all correlated to MPI implementation and nothing on top is added here?

In general, k8s networking would add some overhead on top of regular networking. There are some tweaks you can do on nodes to improve the situation, but that's highly dependent on your cloud provider.

Could you provide some things to look out for? I'll check and try to make changes that are applicable to GCP. I'm still a little surprised that 6% difference can occur comparing a static list of hosts vs. a k8 configuration on pods.

Jan 18 '23 17:01 kkolli

Meaning, the difference is all correlated to MPI implementation and nothing on top is added here?

Correct, mpi-operator just uses the MPI implementation.

Could you provide some things to look out for?

Mind contacting me over the k8s slack?

Jan 18 '23 17:01 alculquicondor

@kkolli @alculquicondor, could you confirm if we have had the opportunity to evaluate or test this specific case scenario? It's crucial to note that the operator should not be directly influencing performance in this context.

Dec 15 '23 21:12 abeltre1

It shouldn't. It all depends on how you configure the networking on your nodes.

Dec 15 '23 21:12 alculquicondor

mpi-operator mpi-operator copied to clipboard

Performance difference executing mpirun from launcher vs. inside pod

mpi-operator
mpi-operator copied to clipboard