benchmarks
benchmarks copied to clipboard
Don't see any transfers on NVLINK with NCCL all_sum on p3.8xlarge
With the following code, nvidia-smi nvlink -g 0 -i 0 report zero bytes transmitted/received.
Same, if I kick off the benchmarks with --all_reduce_spec=nccl --variable_update=replicated
from tensorflow.contrib.nccl import all_sum
with tf.device('/gpu:0'):
a = tf.get_variable(
"a", initializer=tf.constant(1.0, shape=(args.dim, args.dim)))
with tf.device('/gpu:1'):
b = tf.get_variable(
"b", initializer=tf.constant(2.0, shape=(args.dim, args.dim)))
with tf.device('/gpu:0'):
summed_node = all_sum([a, b])
sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True,
log_device_placement=True))
init = tf.global_variables_initializer()
sess.run(init)
with tf.device('/gpu:0'):
summed = sess.run(summed_node)
My machine is an AWS instance of p3.8xlarge. My understanding is, this configuration supports NVLINK.
The execution is fine but when I use nvidia-smi nvlink -g 0 -i 0 the link Tx/Rx counts are zero.
Here's some relevant config info (topology and link status)
(tensorflow_p36) ubuntu@ip-172-31-22-42:~$ nvidia-smi topo --matrix
GPU0 GPU1 GPU2 GPU3 CPU Affinity
GPU0 X NV1 NV1 NV2 0-31
GPU1 NV1 X NV2 NV1 0-31
GPU2 NV1 NV2 X NV2 0-31
GPU3 NV2 NV1 NV2 X 0-31
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
(tensorflow_p36) ubuntu@ip-172-31-22-42:~$ nvidia-smi nvlink --status -i 0
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-1a2670a5-1fdc-24df-2a79-ec6645f0d511)
Link 0: 25.781 GB/s
Link 1: 25.781 GB/s
Link 2: 25.781 GB/s
Link 3: 25.781 GB/s