mpi-operator TF2 Jobs latches on to CPUs if both CPU and GPU are provided in the container resource requests/limit section

I have applied the following MPI Job Yaml. I observe that when I run the workers with only the GPU specified in the resources section the TF2 Job proceeds very fast with 3s per epoch. The TF2 Job is also after the MPI Yaml. However, if I provide both cpu:1, gpu:1 and memory:8g in the worker then the TF2 job starts using the CPU instead of the GPU and it takes 15s per epoch.

apiVersion: kubeflow.org/v1alpha2
kind: MPIJob
metadata:
  name: horovod-asaha-t256jhci2pi
spec:
  slotsPerWorker: 1
  cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        metadata:
          annotations:
            iam.amazonaws.com/role: arn:aws:iam::<XXX>:role/lyftlearn-production-iad
            lyft.net/secret-inject: required
          labels:
            lyft.com/ml-platform: ''
            environment: production
            secretsIam: lyftlearn-production-iad
            version: dummy
        spec:
          containers:
            - name: horovod-asaha-t256jhci2pi-launcher
              image:<XXX>/pythonlyftdistributed:lyftlearn.2827e895dcfca029efd72fe35e18fa7d6b18a311
              command:
                - mpirun
              args:
                - '-np'
                - '2'
                - '--allow-run-as-root'
                - '-bind-to'
                - none
                - '-map-by'
                - slot
                - '-x'
                - NCCL_DEBUG=INFO
                - '-x'
                - LD_LIBRARY_PATH
                - '-x'
                - PATH
                - '-x'
                - NCCL_SOCKET_IFNAME=eth0
                - '-mca'
                - pml
                - ob1
                - '-mca'
                - btl
                - ^openib
                - python
                - /mnt/user-home/distributed-training-exploration/tf2_keras_horovod_mnist.py
              resources:
                limits:
                  cpu: 1
                  memory: 2Gi
              volumeMounts:
                - mountPath: /mnt/user-home
                  name: nfs
          volumes:
            - name: nfs
              persistentVolumeClaim:
                claimName: asaha
    Worker:
      replicas: 2
      template:
        metadata:
          annotations:
            iam.amazonaws.com/role: arn:aws:iam::173840052742:role/lyftlearn-production-iad
            lyft.com/user-job-name: NOTEBOOK-1611821340742
            lyft.net/secret-inject: required
          labels:
            lyft.com/ml-platform: ''
            environment: production
            secretsIam: lyftlearn-production-iad
            version: dummy
        spec:
          containers:
            - name: horovod-asaha-t256jhci2pi-worker
              image: <XXX>/pythonlyftdistributed:lyftlearn.2827e895dcfca029efd72fe35e18fa7d6b18a311
              resources:
                limits:
                  nvidia.com/gpu: 1
              volumeMounts:
                - mountPath: /mnt/user-home
                  name: nfs
          volumes:
            - name: nfs
              persistentVolumeClaim:
                claimName: asaha
          tolerations:
            - key: lyft.net/gpu
              operator: Equal
              value: dedicated
              effect: NoSchedule

The applied TF2 Job from the Horovod Repo

# Copyright 2019 Uber Technologies, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

import tensorflow as tf
import horovod.tensorflow.keras as hvd

# Horovod: initialize Horovod.
hvd.init()

# Horovod: pin GPU to be used to process local rank (one GPU per process)
gpus = tf.config.experimental.list_physical_devices("GPU")
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
    tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], "GPU")

(mnist_images, mnist_labels), _ = tf.keras.datasets.mnist.load_data(
    path="mnist-%d.npz" % hvd.rank()
)

dataset = tf.data.Dataset.from_tensor_slices(
    (
        tf.cast(mnist_images[..., tf.newaxis] / 255.0, tf.float32),
        tf.cast(mnist_labels, tf.int64),
    )
)
dataset = dataset.repeat().shuffle(10000).batch(128)

mnist_model = tf.keras.Sequential(
    [
        tf.keras.layers.Conv2D(32, [3, 3], activation="relu"),
        tf.keras.layers.Conv2D(64, [3, 3], activation="relu"),
        tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
        tf.keras.layers.Dropout(0.25),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation="relu"),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(10, activation="softmax"),
    ]
)

# Horovod: adjust learning rate based on number of GPUs.
scaled_lr = 0.001 * hvd.size()
opt = tf.optimizers.Adam(scaled_lr)

# Horovod: add Horovod DistributedOptimizer.
opt = hvd.DistributedOptimizer(
    opt, backward_passes_per_step=1, average_aggregated_gradients=True
)

# Horovod: Specify `experimental_run_tf_function=False` to ensure TensorFlow
# uses hvd.DistributedOptimizer() to compute gradients.
mnist_model.compile(
    loss=tf.losses.SparseCategoricalCrossentropy(),
    optimizer=opt,
    metrics=["accuracy"],
    experimental_run_tf_function=False,
)

callbacks = [
    # Horovod: broadcast initial variable states from rank 0 to all other processes.
    # This is necessary to ensure consistent initialization of all workers when
    # training is started with random weights or restored from a checkpoint.
    hvd.callbacks.BroadcastGlobalVariablesCallback(0),
    # Horovod: average metrics among workers at the end of every epoch.
    #
    # Note: This callback must be in the list before the ReduceLROnPlateau,
    # TensorBoard or other metrics-based callbacks.
    hvd.callbacks.MetricAverageCallback(),
    # Horovod: using `lr = 1.0 * hvd.size()` from the very beginning leads to worse final
    # accuracy. Scale the learning rate `lr = 1.0` ---> `lr = 1.0 * hvd.size()` during
    # the first three epochs. See https://arxiv.org/abs/1706.02677 for details.
    hvd.callbacks.LearningRateWarmupCallback(
        initial_lr=scaled_lr, warmup_epochs=3, verbose=1
    ),
]

# Horovod: save checkpoints only on worker 0 to prevent other workers from corrupting them.
if hvd.rank() == 0:
    callbacks.append(tf.keras.callbacks.ModelCheckpoint("./checkpoint-{epoch}.h5"))

# Horovod: write logs on worker 0.
verbose = 1 if hvd.rank() == 0 else 0

# Train the model.
# Horovod: adjust number of steps based on number of GPUs.
mnist_model.fit(
    dataset,
    steps_per_epoch=500 // hvd.size(),
    callbacks=callbacks,
    epochs=18,
    verbose=verbose,
)

Feb 24 '21 01:02 asahalyft

Is this an issue for Horovod or MPI Operator?

Feb 24 '21 01:02 terrytangyuan

@tgaddair Hi Travis, would you have some thoughts/insights on this?

Mar 03 '21 19:03 asahalyft

Hey @asahalyft, are you sure you're getting a GPU? Can you check the value of gpus = tf.config.experimental.list_physical_devices("GPU")?

Mar 03 '21 20:03 tgaddair

Yes @tgaddair When I specify only GPU e.g.

resources:
                limits:
                  nvidia.com/gpu: 1

I am definitely getting the GPUs and each epoch takes only 3secs per epoch. I have checked nvidia-smi which also shows the GPUs are being in use.

But, if I specify both, CPU and GPU, then CPU takes precedence e.g.

resources:
                limits:
                  cpu: 1
                  memory: 8Gi
                  nvidia.com/gpu: 1

I see the same job running very slow and each epoch takes 15secs per epoch.

Mar 03 '21 20:03 asahalyft

When you run nvidia-smi in the second case, does it also show GPU availability and utilization?

You mentioned that it is running slow, but did you check the value of tf.config.experimental.list_physical_devices("GPU") as I suggested? Is it empty when you specify a cpu, or does it still show the GPU device as available?

Mar 03 '21 21:03 tgaddair

@tgaddair sure, let me do that experiment and come back.

Mar 03 '21 21:03 asahalyft

Hi @tgaddair, I conducted two experiments. Here are the detailed experimental results:

TLDR; In both cases GPUs are detected. I have tested the same tf2 script in both cases.

However,

When we specify only GPU in the resource section of the worker pods then the nvidia-smi shows reasonable utilization of 50% for the GPUs in each worker. And it takes 3secs for each epoch to complete.
When we specify both CPU and GPU in the resource section of the worker pods then the nvidia-smi shows low utilization of <=27% for the GPUs in each worker. And it takes 7-8secs for each epoch to complete.

And this doubling of training time or even more for the 2nd case is consistent, I have tried before also and also today couple of times.

Details: I have added the following to the tensorflow script.

print("DEBUG GPUS INFO START=================================================")
print(tf.config.experimental.list_physical_devices('GPU'))
print("DEBUG GPUS INFO END=================================================")

Experiment 1: In this experiment the workers only have GPUs specified in the worker section:

# ------- Create GPU Distributed MPI Job ------ #
apiVersion: kubeflow.org/v1alpha2
kind: MPIJob
metadata: 
  name: tf2-keras-mnist-mpi-gpu
  labels:
    lyft.com/ml-platform: ""   
spec: 
  slotsPerWorker: 1
  cleanPodPolicy: Running
  mpiReplicaSpecs: 
    Launcher: 
      replicas: 1
      template: 
        metadata:
          labels:
            lyft.com/ml-platform: ""   
        spec:
          containers: 
            - name: keras-mnist-mpi-launcher
              image: "XXXX/pythonlyftdistributed:lyftlearn.19b2b1cfb7cd917841d2f06fc575e43bdfb3ff9f"
              command: 
                - mpirun
              args: 
                - "-np"
                - "2"
                - "--allow-run-as-root"
                - "-bind-to"
                - none
                - "-map-by"
                - slot
                - "-x"
                - NCCL_DEBUG=INFO
                - "-x"
                - LD_LIBRARY_PATH
                - "-x"
                - PATH
                - "-x"
                - NCCL_SOCKET_IFNAME=eth0
                - "-mca"
                - pml
                - ob1
                - "-mca"
                - btl
                - ^openib
                - python
                - /mnt/user-home/distributed-training-exploration/tf2_keras_horovod_mnist.py
              resources: 
                limits: 
                  cpu: 1
                  memory: 2Gi
              volumeMounts:
              - mountPath: /mnt/user-home
                name: nfs
          volumes:
          - name: nfs
            persistentVolumeClaim:
              claimName: asaha
    Worker: 
      replicas: 2
      template: 
        metadata:
          labels:
            lyft.com/ml-platform: ""
        spec: 
          containers: 
            - name: keras-mnist-mpi-worker
              image: "XXXX/pythonlyftdistributed:lyftlearn.19b2b1cfb7cd917841d2f06fc575e43bdfb3ff9f"
              resources: 
                limits: 
                  nvidia.com/gpu: 1
              volumeMounts:
              - mountPath: /mnt/user-home
                name: nfs
          volumes:
          - name: nfs
            persistentVolumeClaim:
              claimName: asaha
          tolerations: 
            - key: lyft.net/gpu
              operator: Equal
              value: dedicated
              effect: NoSchedule

GPUs are detected as the log shows

2021-03-03 22:07:29.871363: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
DEBUG GPUS INFO START=================================================
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
DEBUG GPUS INFO END=================================================
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz

nvidia-smi shows higher utilization of 50%+ worker - 0

(base) asaha-mbp151:exploration asaha$ kubectl exec -it tf2-keras-mnist-mpi-gpu-worker-0 -n asaha -- /bin/sh
# nvidia-smi
Wed Mar  3 22:19:45 2021   
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1A.0 Off |                    0 |
| N/A   43C    P0   104W / 300W |   1950MiB / 16160MiB |     51%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

worker - 1

(base) asaha-mbp151:exploration asaha$ kubectl exec -it tf2-keras-mnist-mpi-gpu-worker-1 -n asaha -- /bin/sh
# nvidia-smi
Wed Mar  3 22:20:35 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   43C    P0    97W / 300W |   1950MiB / 16160MiB |     52%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Logs: 3 secs per epoch

tf2-keras-mnist-mpi-gpu-worker-1:30:160 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tf2-keras-mnist-mpi-gpu-worker-0:33:163 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tf2-keras-mnist-mpi-gpu-worker-1:30:160 [0] NCCL INFO comm 0x7f068c30fa80 rank 1 nranks 2 cudaDev 0 busId 1a0 - Init COMPLETE
tf2-keras-mnist-mpi-gpu-worker-0:33:163 [0] NCCL INFO comm 0x7f79a03131e0 rank 0 nranks 2 cudaDev 0 busId 180 - Init COMPLETE
tf2-keras-mnist-mpi-gpu-worker-0:33:163 [0] NCCL INFO Launch mode Parallel
250/250 [==============================] - 3s 13ms/step - loss: 0.2983 - accuracy: 0.9078
Epoch 2/24
250/250 [==============================] - 3s 12ms/step - loss: 0.0930 - accuracy: 0.9717
Epoch 3/24
250/250 [==============================] - ETA: 0s - loss: 0.0738 - accuracy: 0.9785
Epoch 3: finished gradual learning rate warmup to 0.002.

Epoch 3: finished gradual learning rate warmup to 0.002.
250/250 [==============================] - 3s 14ms/step - loss: 0.0738 - accuracy: 0.9785
Epoch 4/24
250/250 [==============================] - 3s 13ms/step - loss: 0.0575 - accuracy: 0.9828
Epoch 5/24
250/250 [==============================] - 3s 13ms/step - loss: 0.0501 - accuracy: 0.9844
Epoch 6/24
250/250 [==============================] - 4s 17ms/step - loss: 0.0432 - accuracy: 0.9870
Epoch 7/24
250/250 [==============================] - 3s 12ms/step - loss: 0.0385 - accuracy: 0.9874

Experiment 2: In this experiment the workers have both CPUs & GPUs specified in the worker section:

# ------- Create GPU Distributed MPI Job ------ #
apiVersion: kubeflow.org/v1alpha2
kind: MPIJob
metadata: 
  name: tf2-keras-mnist-mpi-gpu
  labels:
    lyft.com/ml-platform: ""   
spec: 
  slotsPerWorker: 1
  cleanPodPolicy: Running
  mpiReplicaSpecs: 
    Launcher: 
      replicas: 1
      template: 
        metadata:
          labels:
            lyft.com/ml-platform: ""   
        spec:
          containers: 
            - name: keras-mnist-mpi-launcher
              image: "XXXX/pythonlyftdistributed:lyftlearn.19b2b1cfb7cd917841d2f06fc575e43bdfb3ff9f"
              command: 
                - mpirun
              args: 
                - "-np"
                - "2"
                - "--allow-run-as-root"
                - "-bind-to"
                - none
                - "-map-by"
                - slot
                - "-x"
                - NCCL_DEBUG=INFO
                - "-x"
                - LD_LIBRARY_PATH
                - "-x"
                - PATH
                - "-x"
                - NCCL_SOCKET_IFNAME=eth0
                - "-mca"
                - pml
                - ob1
                - "-mca"
                - btl
                - ^openib
                - python
                - /mnt/user-home/distributed-training-exploration/tf2_keras_horovod_mnist.py
              resources: 
                limits: 
                  cpu: 1
                  memory: 2Gi
              volumeMounts:
              - mountPath: /mnt/user-home
                name: nfs
          volumes:
          - name: nfs
            persistentVolumeClaim:
              claimName: asaha
    Worker: 
      replicas: 2
      template: 
        metadata:
          labels:
            lyft.com/ml-platform: ""
        spec: 
          containers: 
            - name: keras-mnist-mpi-worker
              image: "XXXX/pythonlyftdistributed:lyftlearn.19b2b1cfb7cd917841d2f06fc575e43bdfb3ff9f"
              resources: 
                limits: 
                  cpu: 1
                  memory: 8Gi
                  nvidia.com/gpu: 1
              volumeMounts:
              - mountPath: /mnt/user-home
                name: nfs
          volumes:
          - name: nfs
            persistentVolumeClaim:
              claimName: asaha
          tolerations: 
            - key: lyft.net/gpu
              operator: Equal
              value: dedicated
              effect: NoSchedule

GPUs are detected as the log shows

DEBUG GPUS INFO START=================================================
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
DEBUG GPUS INFO END=================================================

nvidia-smi shows low utilization of 27%+ worker - 0

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   50C    P0    88W / 300W |   1950MiB / 16160MiB |     27%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

worker - 1

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:18.0 Off |                    0 |
| N/A   46C    P0    77W / 300W |   1950MiB / 16160MiB |     13%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Logs: 8 secs per epoch

tf2-keras-mnist-mpi-gpu-worker-1:31:161 [0] NCCL INFO comm 0x7efe3c30e950 rank 1 nranks 2 cudaDev 0 busId 180 - Init COMPLETE
tf2-keras-mnist-mpi-gpu-worker-0:32:162 [0] NCCL INFO comm 0x7f9cfc310200 rank 0 nranks 2 cudaDev 0 busId 1c0 - Init COMPLETE
tf2-keras-mnist-mpi-gpu-worker-0:32:162 [0] NCCL INFO Launch mode Parallel
250/250 [==============================] - 8s 31ms/step - loss: 0.3136 - accuracy: 0.9041
Epoch 2/24
250/250 [==============================] - 7s 28ms/step - loss: 0.0979 - accuracy: 0.9708
Epoch 3/24
249/250 [============================>.] - ETA: 0s - loss: 0.0702 - accuracy: 0.9786
Epoch 3: finished gradual learning rate warmup to 0.002.

Epoch 3: finished gradual learning rate warmup to 0.002.
250/250 [==============================] - 8s 32ms/step - loss: 0.0703 - accuracy: 0.9785
Epoch 4/24
250/250 [==============================] - 8s 30ms/step - loss: 0.0587 - accuracy: 0.9825
Epoch 5/24
250/250 [==============================] - 7s 29ms/step - loss: 0.0473 - accuracy: 0.9856
Epoch 6/24
250/250 [==============================] - 7s 28ms/step - loss: 0.0411 - accuracy: 0.9876
Epoch 7/24
250/250 [==============================] - 7s 28ms/step - loss: 0.0357 - accuracy: 0.9877
Epoch 8/24

Mar 03 '21 22:03 asahalyft

Thanks for the detailed experiments, @asahalyft. My best guess, based on these results, is that Horovod is correctly running with GPU support enabled, but that when you request cpu: 1 in your resource limits, you are becoming bottlenecked by the CPU. Can you try increasing this to, say, 8, and see if it has an effect on performance?

Mar 03 '21 22:03 tgaddair

Great suggestion @tgaddair . I bumped up to cpu:8 and the GPU usage is back to 50% per worker and the epoch time is back to 3s like in case 1. Could you please share your intuition how you came up with that number 8 for the cpus and which part of the training is being bottlenecked by the cpu?

Mar 03 '21 23:03 asahalyft

Hey @asahalyft, TensorFlow internally uses multi-threading to process ops in parallel, and Horovod also maintains a separate thread for processing tensors. As such, if you only have a single core to work with, you're not going to be able to overlap as much computation with the communication. 8 was chosen somewhat arbitrarily, but should be a decent starting point.

Mar 03 '21 23:03 tgaddair

mpi-operator mpi-operator copied to clipboard

TF2 Jobs latches on to CPUs if both CPU and GPU are provided in the container resource requests/limit section

mpi-operator
mpi-operator copied to clipboard