mpi-operator
mpi-operator copied to clipboard
TF2 Jobs latches on to CPUs if both CPU and GPU are provided in the container resource requests/limit section
I have applied the following MPI Job Yaml. I observe that when I run the workers with only the GPU specified in the resources section the TF2 Job proceeds very fast with 3s
per epoch. The TF2 Job is also after the MPI Yaml. However, if I provide both cpu:1, gpu:1 and memory:8g in the worker then the TF2 job starts using the CPU instead of the GPU and it takes 15s
per epoch.
apiVersion: kubeflow.org/v1alpha2
kind: MPIJob
metadata:
name: horovod-asaha-t256jhci2pi
spec:
slotsPerWorker: 1
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
metadata:
annotations:
iam.amazonaws.com/role: arn:aws:iam::<XXX>:role/lyftlearn-production-iad
lyft.net/secret-inject: required
labels:
lyft.com/ml-platform: ''
environment: production
secretsIam: lyftlearn-production-iad
version: dummy
spec:
containers:
- name: horovod-asaha-t256jhci2pi-launcher
image:<XXX>/pythonlyftdistributed:lyftlearn.2827e895dcfca029efd72fe35e18fa7d6b18a311
command:
- mpirun
args:
- '-np'
- '2'
- '--allow-run-as-root'
- '-bind-to'
- none
- '-map-by'
- slot
- '-x'
- NCCL_DEBUG=INFO
- '-x'
- LD_LIBRARY_PATH
- '-x'
- PATH
- '-x'
- NCCL_SOCKET_IFNAME=eth0
- '-mca'
- pml
- ob1
- '-mca'
- btl
- ^openib
- python
- /mnt/user-home/distributed-training-exploration/tf2_keras_horovod_mnist.py
resources:
limits:
cpu: 1
memory: 2Gi
volumeMounts:
- mountPath: /mnt/user-home
name: nfs
volumes:
- name: nfs
persistentVolumeClaim:
claimName: asaha
Worker:
replicas: 2
template:
metadata:
annotations:
iam.amazonaws.com/role: arn:aws:iam::173840052742:role/lyftlearn-production-iad
lyft.com/user-job-name: NOTEBOOK-1611821340742
lyft.net/secret-inject: required
labels:
lyft.com/ml-platform: ''
environment: production
secretsIam: lyftlearn-production-iad
version: dummy
spec:
containers:
- name: horovod-asaha-t256jhci2pi-worker
image: <XXX>/pythonlyftdistributed:lyftlearn.2827e895dcfca029efd72fe35e18fa7d6b18a311
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /mnt/user-home
name: nfs
volumes:
- name: nfs
persistentVolumeClaim:
claimName: asaha
tolerations:
- key: lyft.net/gpu
operator: Equal
value: dedicated
effect: NoSchedule
The applied TF2 Job from the Horovod Repo
# Copyright 2019 Uber Technologies, Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import tensorflow as tf
import horovod.tensorflow.keras as hvd
# Horovod: initialize Horovod.
hvd.init()
# Horovod: pin GPU to be used to process local rank (one GPU per process)
gpus = tf.config.experimental.list_physical_devices("GPU")
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], "GPU")
(mnist_images, mnist_labels), _ = tf.keras.datasets.mnist.load_data(
path="mnist-%d.npz" % hvd.rank()
)
dataset = tf.data.Dataset.from_tensor_slices(
(
tf.cast(mnist_images[..., tf.newaxis] / 255.0, tf.float32),
tf.cast(mnist_labels, tf.int64),
)
)
dataset = dataset.repeat().shuffle(10000).batch(128)
mnist_model = tf.keras.Sequential(
[
tf.keras.layers.Conv2D(32, [3, 3], activation="relu"),
tf.keras.layers.Conv2D(64, [3, 3], activation="relu"),
tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
tf.keras.layers.Dropout(0.25),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation="relu"),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(10, activation="softmax"),
]
)
# Horovod: adjust learning rate based on number of GPUs.
scaled_lr = 0.001 * hvd.size()
opt = tf.optimizers.Adam(scaled_lr)
# Horovod: add Horovod DistributedOptimizer.
opt = hvd.DistributedOptimizer(
opt, backward_passes_per_step=1, average_aggregated_gradients=True
)
# Horovod: Specify `experimental_run_tf_function=False` to ensure TensorFlow
# uses hvd.DistributedOptimizer() to compute gradients.
mnist_model.compile(
loss=tf.losses.SparseCategoricalCrossentropy(),
optimizer=opt,
metrics=["accuracy"],
experimental_run_tf_function=False,
)
callbacks = [
# Horovod: broadcast initial variable states from rank 0 to all other processes.
# This is necessary to ensure consistent initialization of all workers when
# training is started with random weights or restored from a checkpoint.
hvd.callbacks.BroadcastGlobalVariablesCallback(0),
# Horovod: average metrics among workers at the end of every epoch.
#
# Note: This callback must be in the list before the ReduceLROnPlateau,
# TensorBoard or other metrics-based callbacks.
hvd.callbacks.MetricAverageCallback(),
# Horovod: using `lr = 1.0 * hvd.size()` from the very beginning leads to worse final
# accuracy. Scale the learning rate `lr = 1.0` ---> `lr = 1.0 * hvd.size()` during
# the first three epochs. See https://arxiv.org/abs/1706.02677 for details.
hvd.callbacks.LearningRateWarmupCallback(
initial_lr=scaled_lr, warmup_epochs=3, verbose=1
),
]
# Horovod: save checkpoints only on worker 0 to prevent other workers from corrupting them.
if hvd.rank() == 0:
callbacks.append(tf.keras.callbacks.ModelCheckpoint("./checkpoint-{epoch}.h5"))
# Horovod: write logs on worker 0.
verbose = 1 if hvd.rank() == 0 else 0
# Train the model.
# Horovod: adjust number of steps based on number of GPUs.
mnist_model.fit(
dataset,
steps_per_epoch=500 // hvd.size(),
callbacks=callbacks,
epochs=18,
verbose=verbose,
)
Is this an issue for Horovod or MPI Operator?
@tgaddair Hi Travis, would you have some thoughts/insights on this?
Hey @asahalyft, are you sure you're getting a GPU? Can you check the value of gpus = tf.config.experimental.list_physical_devices("GPU")
?
Yes @tgaddair When I specify only GPU e.g.
resources:
limits:
nvidia.com/gpu: 1
I am definitely getting the GPUs and each epoch takes only 3secs per epoch
. I have checked nvidia-smi
which also shows the GPUs are being in use.
But, if I specify both, CPU and GPU, then CPU takes precedence e.g.
resources:
limits:
cpu: 1
memory: 8Gi
nvidia.com/gpu: 1
I see the same job running very slow and each epoch takes 15secs per epoch
.
When you run nvidia-smi
in the second case, does it also show GPU availability and utilization?
You mentioned that it is running slow, but did you check the value of tf.config.experimental.list_physical_devices("GPU")
as I suggested? Is it empty when you specify a cpu, or does it still show the GPU device as available?
@tgaddair sure, let me do that experiment and come back.
Hi @tgaddair, I conducted two experiments. Here are the detailed experimental results:
TLDR; In both cases GPUs are detected. I have tested the same tf2 script in both cases.
However,
-
When we specify
only GPU
in the resource section of the worker pods then thenvidia-smi
shows reasonable utilization of 50% for the GPUs in each worker. And it takes3secs
for each epoch to complete. -
When we specify
both CPU and GPU
in the resource section of the worker pods then thenvidia-smi
shows low utilization of <=27% for the GPUs in each worker. And it takes7-8secs
for each epoch to complete.
And this doubling of training time or even more for the 2nd case is consistent, I have tried before also and also today couple of times.
Details: I have added the following to the tensorflow script.
print("DEBUG GPUS INFO START=================================================")
print(tf.config.experimental.list_physical_devices('GPU'))
print("DEBUG GPUS INFO END=================================================")
Experiment 1: In this experiment the workers only have GPUs
specified in the worker section:
# ------- Create GPU Distributed MPI Job ------ #
apiVersion: kubeflow.org/v1alpha2
kind: MPIJob
metadata:
name: tf2-keras-mnist-mpi-gpu
labels:
lyft.com/ml-platform: ""
spec:
slotsPerWorker: 1
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
metadata:
labels:
lyft.com/ml-platform: ""
spec:
containers:
- name: keras-mnist-mpi-launcher
image: "XXXX/pythonlyftdistributed:lyftlearn.19b2b1cfb7cd917841d2f06fc575e43bdfb3ff9f"
command:
- mpirun
args:
- "-np"
- "2"
- "--allow-run-as-root"
- "-bind-to"
- none
- "-map-by"
- slot
- "-x"
- NCCL_DEBUG=INFO
- "-x"
- LD_LIBRARY_PATH
- "-x"
- PATH
- "-x"
- NCCL_SOCKET_IFNAME=eth0
- "-mca"
- pml
- ob1
- "-mca"
- btl
- ^openib
- python
- /mnt/user-home/distributed-training-exploration/tf2_keras_horovod_mnist.py
resources:
limits:
cpu: 1
memory: 2Gi
volumeMounts:
- mountPath: /mnt/user-home
name: nfs
volumes:
- name: nfs
persistentVolumeClaim:
claimName: asaha
Worker:
replicas: 2
template:
metadata:
labels:
lyft.com/ml-platform: ""
spec:
containers:
- name: keras-mnist-mpi-worker
image: "XXXX/pythonlyftdistributed:lyftlearn.19b2b1cfb7cd917841d2f06fc575e43bdfb3ff9f"
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /mnt/user-home
name: nfs
volumes:
- name: nfs
persistentVolumeClaim:
claimName: asaha
tolerations:
- key: lyft.net/gpu
operator: Equal
value: dedicated
effect: NoSchedule
GPUs are detected as the log shows
2021-03-03 22:07:29.871363: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
DEBUG GPUS INFO START=================================================
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
DEBUG GPUS INFO END=================================================
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
nvidia-smi shows higher utilization of 50%+ worker - 0
(base) asaha-mbp151:exploration asaha$ kubectl exec -it tf2-keras-mnist-mpi-gpu-worker-0 -n asaha -- /bin/sh
# nvidia-smi
Wed Mar 3 22:19:45 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1A.0 Off | 0 |
| N/A 43C P0 104W / 300W | 1950MiB / 16160MiB | 51% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
worker - 1
(base) asaha-mbp151:exploration asaha$ kubectl exec -it tf2-keras-mnist-mpi-gpu-worker-1 -n asaha -- /bin/sh
# nvidia-smi
Wed Mar 3 22:20:35 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 |
| N/A 43C P0 97W / 300W | 1950MiB / 16160MiB | 52% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Logs: 3 secs
per epoch
tf2-keras-mnist-mpi-gpu-worker-1:30:160 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tf2-keras-mnist-mpi-gpu-worker-0:33:163 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
tf2-keras-mnist-mpi-gpu-worker-1:30:160 [0] NCCL INFO comm 0x7f068c30fa80 rank 1 nranks 2 cudaDev 0 busId 1a0 - Init COMPLETE
tf2-keras-mnist-mpi-gpu-worker-0:33:163 [0] NCCL INFO comm 0x7f79a03131e0 rank 0 nranks 2 cudaDev 0 busId 180 - Init COMPLETE
tf2-keras-mnist-mpi-gpu-worker-0:33:163 [0] NCCL INFO Launch mode Parallel
250/250 [==============================] - 3s 13ms/step - loss: 0.2983 - accuracy: 0.9078
Epoch 2/24
250/250 [==============================] - 3s 12ms/step - loss: 0.0930 - accuracy: 0.9717
Epoch 3/24
250/250 [==============================] - ETA: 0s - loss: 0.0738 - accuracy: 0.9785
Epoch 3: finished gradual learning rate warmup to 0.002.
Epoch 3: finished gradual learning rate warmup to 0.002.
250/250 [==============================] - 3s 14ms/step - loss: 0.0738 - accuracy: 0.9785
Epoch 4/24
250/250 [==============================] - 3s 13ms/step - loss: 0.0575 - accuracy: 0.9828
Epoch 5/24
250/250 [==============================] - 3s 13ms/step - loss: 0.0501 - accuracy: 0.9844
Epoch 6/24
250/250 [==============================] - 4s 17ms/step - loss: 0.0432 - accuracy: 0.9870
Epoch 7/24
250/250 [==============================] - 3s 12ms/step - loss: 0.0385 - accuracy: 0.9874
Experiment 2: In this experiment the workers have both CPUs & GPUs
specified in the worker section:
# ------- Create GPU Distributed MPI Job ------ #
apiVersion: kubeflow.org/v1alpha2
kind: MPIJob
metadata:
name: tf2-keras-mnist-mpi-gpu
labels:
lyft.com/ml-platform: ""
spec:
slotsPerWorker: 1
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
metadata:
labels:
lyft.com/ml-platform: ""
spec:
containers:
- name: keras-mnist-mpi-launcher
image: "XXXX/pythonlyftdistributed:lyftlearn.19b2b1cfb7cd917841d2f06fc575e43bdfb3ff9f"
command:
- mpirun
args:
- "-np"
- "2"
- "--allow-run-as-root"
- "-bind-to"
- none
- "-map-by"
- slot
- "-x"
- NCCL_DEBUG=INFO
- "-x"
- LD_LIBRARY_PATH
- "-x"
- PATH
- "-x"
- NCCL_SOCKET_IFNAME=eth0
- "-mca"
- pml
- ob1
- "-mca"
- btl
- ^openib
- python
- /mnt/user-home/distributed-training-exploration/tf2_keras_horovod_mnist.py
resources:
limits:
cpu: 1
memory: 2Gi
volumeMounts:
- mountPath: /mnt/user-home
name: nfs
volumes:
- name: nfs
persistentVolumeClaim:
claimName: asaha
Worker:
replicas: 2
template:
metadata:
labels:
lyft.com/ml-platform: ""
spec:
containers:
- name: keras-mnist-mpi-worker
image: "XXXX/pythonlyftdistributed:lyftlearn.19b2b1cfb7cd917841d2f06fc575e43bdfb3ff9f"
resources:
limits:
cpu: 1
memory: 8Gi
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /mnt/user-home
name: nfs
volumes:
- name: nfs
persistentVolumeClaim:
claimName: asaha
tolerations:
- key: lyft.net/gpu
operator: Equal
value: dedicated
effect: NoSchedule
GPUs are detected as the log shows
DEBUG GPUS INFO START=================================================
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
DEBUG GPUS INFO END=================================================
nvidia-smi shows low utilization of 27%+ worker - 0
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 |
| N/A 50C P0 88W / 300W | 1950MiB / 16160MiB | 27% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
worker - 1
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:18.0 Off | 0 |
| N/A 46C P0 77W / 300W | 1950MiB / 16160MiB | 13% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Logs: 8 secs
per epoch
tf2-keras-mnist-mpi-gpu-worker-1:31:161 [0] NCCL INFO comm 0x7efe3c30e950 rank 1 nranks 2 cudaDev 0 busId 180 - Init COMPLETE
tf2-keras-mnist-mpi-gpu-worker-0:32:162 [0] NCCL INFO comm 0x7f9cfc310200 rank 0 nranks 2 cudaDev 0 busId 1c0 - Init COMPLETE
tf2-keras-mnist-mpi-gpu-worker-0:32:162 [0] NCCL INFO Launch mode Parallel
250/250 [==============================] - 8s 31ms/step - loss: 0.3136 - accuracy: 0.9041
Epoch 2/24
250/250 [==============================] - 7s 28ms/step - loss: 0.0979 - accuracy: 0.9708
Epoch 3/24
249/250 [============================>.] - ETA: 0s - loss: 0.0702 - accuracy: 0.9786
Epoch 3: finished gradual learning rate warmup to 0.002.
Epoch 3: finished gradual learning rate warmup to 0.002.
250/250 [==============================] - 8s 32ms/step - loss: 0.0703 - accuracy: 0.9785
Epoch 4/24
250/250 [==============================] - 8s 30ms/step - loss: 0.0587 - accuracy: 0.9825
Epoch 5/24
250/250 [==============================] - 7s 29ms/step - loss: 0.0473 - accuracy: 0.9856
Epoch 6/24
250/250 [==============================] - 7s 28ms/step - loss: 0.0411 - accuracy: 0.9876
Epoch 7/24
250/250 [==============================] - 7s 28ms/step - loss: 0.0357 - accuracy: 0.9877
Epoch 8/24
Thanks for the detailed experiments, @asahalyft. My best guess, based on these results, is that Horovod is correctly running with GPU support enabled, but that when you request cpu: 1
in your resource limits, you are becoming bottlenecked by the CPU. Can you try increasing this to, say, 8, and see if it has an effect on performance?
Great suggestion @tgaddair . I bumped up to cpu:8
and the GPU usage is back to 50% per worker and the epoch time is back to 3s
like in case 1. Could you please share your intuition how you came up with that number 8
for the cpus and which part of the training is being bottlenecked by the cpu?
Hey @asahalyft, TensorFlow internally uses multi-threading to process ops in parallel, and Horovod also maintains a separate thread for processing tensors. As such, if you only have a single core to work with, you're not going to be able to overlap as much computation with the communication. 8 was chosen somewhat arbitrarily, but should be a decent starting point.