deep-learning-containers icon indicating copy to clipboard operation
deep-learning-containers copied to clipboard

~8X drop in speed when upgrading base image.

Open mkserge opened this issue 1 year ago • 1 comments

Hello,

I have a SageMaker training job that uses 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04 as its base image. With this I am observing approx 2,340,707 smpl/s processing speeds.

Upgrading this image to anything else, for example, 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker results in ~8X slowdown with speeds dropping to approximately 330,473 smpl/s.

If I build a virtual environment on an EC2 instance with the same version of PyTorch (1.9.1) as in the fast docker image I still get slow speeds when training the model at approximately 330K samples/second.

However, if I pull the first docker image above onto the same instance, and train the exact same code from inside the container I get fast speeds.

If I pull the second docker image above onto the same instance, and train the exact code from inside this container I get slow speeds. The only difference in the logs is some output relating to NCCL (see below).

I can't figure out how to tackle this. What is it that 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04 image is doing that makes the training so fast, that is not reproducible anywhere.

For reference, I am training a simple 1 hidden layer NN (word2vec style) using PyTorch DataParallel approach on a single ml.p3.16xlarge instance with 8 V100 GPUs. All intra-node.

Any pointers on how to tackle this?

Here's the part of the log that is present in the fast container, but is not in the slow container.

005b3f0f299a:27:27 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>

005b3f0f299a:27:27 [0] ofi_init:1134 NCCL WARN NET/OFI Only EFA provider is supported
005b3f0f299a:27:27 [0] NCCL INFO NET/IB : No device found.
005b3f0f299a:27:27 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
005b3f0f299a:27:27 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
005b3f0f299a:27:146 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
005b3f0f299a:27:146 [7] NCCL INFO Trees [0] 4/-1/-1->7->6|6->7->4/-1/-1 [1] 4/-1/-1->7->6|6->7->4/-1/-1 [2] 6/-1/-1->7->4|4->7->6/-1/-1 [3] 6/-1/-1->7->4|4->7->6/-1/-1 [4] 5/-1/-1->7->3|3->7->5/-1/-1 [5] 3/-1/-1->7->5|5->7->3/-1/-1 [6] 4/-1/-1->7->6|6->7->4/-1/-1 [7] 4/-1/-1->7->6|6->7->4/-1/-1 [8] 6/-1/-1->7->4|4->7->6/-1/-1 [9] 6/-1/-1->7->4|4->7->6/-1/-1 [10] 5/-1/-1->7->3|3->7->5/-1/-1 [11] 3/-1/-1->7->5|5->7->3/-1/-1
005b3f0f299a:27:139 [0] NCCL INFO Channel 00/12 :    0   3   2   1   5   6   7   4
005b3f0f299a:27:141 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
005b3f0f299a:27:140 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
005b3f0f299a:27:139 [0] NCCL INFO Channel 01/12 :    0   3   2   1   5   6   7   4
005b3f0f299a:27:140 [1] NCCL INFO Trees [0] 5/-1/-1->1->2|2->1->5/-1/-1 [1] 5/-1/-1->1->2|2->1->5/-1/-1 [2] 2/-1/-1->1->5|5->1->2/-1/-1 [3] 2/-1/-1->1->5|5->1->2/-1/-1 [4] 3/-1/-1->1->0|0->1->3/-1/-1 [5] -1/-1/-1->1->3|3->1->-1/-1/-1 [6] 5/-1/-1->1->2|2->1->5/-1/-1 [7] 5/-1/-1->1->2|2->1->5/-1/-1 [8] 2/-1/-1->1->5|5->1->2/-1/-1 [9] 2/-1/-1->1->5|5->1->2/-1/-1 [10] 3/-1/-1->1->0|0->1->3/-1/-1 [11] -1/-1/-1->1->3|3->1->-1/-1/-1
005b3f0f299a:27:139 [0] NCCL INFO Channel 02/12 :    0   4   7   6   5   1   2   3
005b3f0f299a:27:142 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
005b3f0f299a:27:141 [2] NCCL INFO Trees [0] 1/-1/-1->2->3|3->2->1/-1/-1 [1] 1/-1/-1->2->3|3->2->1/-1/-1 [2] 3/-1/-1->2->1|1->2->3/-1/-1 [3] 3/-1/-1->2->1|1->2->3/-1/-1 [4] -1/-1/-1->2->6|6->2->-1/-1/-1 [5] 6/-1/-1->2->0|0->2->6/-1/-1 [6] 1/-1/-1->2->3|3->2->1/-1/-1 [7] 1/-1/-1->2->3|3->2->1/-1/-1 [8] 3/-1/-1->2->1|1->2->3/-1/-1 [9] 3/-1/-1->2->1|1->2->3/-1/-1 [10] -1/-1/-1->2->6|6->2->-1/-1/-1 [11] 6/-1/-1->2->0|0->2->6/-1/-1
005b3f0f299a:27:143 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
005b3f0f299a:27:142 [3] NCCL INFO Trees [0] 2/-1/-1->3->0|0->3->2/-1/-1 [1] 2/-1/-1->3->0|0->3->2/-1/-1 [2] -1/-1/-1->3->2|2->3->-1/-1/-1 [3] -1/-1/-1->3->2|2->3->-1/-1/-1 [4] 7/-1/-1->3->1|1->3->7/-1/-1 [5] 1/-1/-1->3->7|7->3->1/-1/-1 [6] 2/-1/-1->3->0|0->3->2/-1/-1 [7] 2/-1/-1->3->0|0->3->2/-1/-1 [8] -1/-1/-1->3->2|2->3->-1/-1/-1 [9] -1/-1/-1->3->2|2->3->-1/-1/-1 [10] 7/-1/-1->3->1|1->3->7/-1/-1 [11] 1/-1/-1->3->7|7->3->1/-1/-1
005b3f0f299a:27:139 [0] NCCL INFO Channel 03/12 :    0   4   7   6   5   1   2   3
005b3f0f299a:27:139 [0] NCCL INFO Channel 04/12 :    0   1   3   7   5   4   6   2
005b3f0f299a:27:144 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
005b3f0f299a:27:139 [0] NCCL INFO Channel 05/12 :    0   2   6   4   5   7   3   1
005b3f0f299a:27:144 [5] NCCL INFO Trees [0] 6/-1/-1->5->1|1->5->6/-1/-1 [1] 6/-1/-1->5->1|1->5->6/-1/-1 [2] 1/-1/-1->5->6|6->5->1/-1/-1 [3] 1/-1/-1->5->6|6->5->1/-1/-1 [4] 4/-1/-1->5->7|7->5->4/-1/-1 [5] 7/-1/-1->5->4|4->5->7/-1/-1 [6] 6/-1/-1->5->1|1->5->6/-1/-1 [7] 6/-1/-1->5->1|1->5->6/-1/-1 [8] 1/-1/-1->5->6|6->5->1/-1/-1 [9] 1/-1/-1->5->6|6->5->1/-1/-1 [10] 4/-1/-1->5->7|7->5->4/-1/-1 [11] 7/-1/-1->5->4|4->5->7/-1/-1
005b3f0f299a:27:139 [0] NCCL INFO Channel 06/12 :    0   3   2   1   5   6   7   4
005b3f0f299a:27:139 [0] NCCL INFO Channel 07/12 :    0   3   2   1   5   6   7   4
005b3f0f299a:27:139 [0] NCCL INFO Channel 08/12 :    0   4   7   6   5   1   2   3
005b3f0f299a:27:139 [0] NCCL INFO Channel 09/12 :    0   4   7   6   5   1   2   3
005b3f0f299a:27:139 [0] NCCL INFO Channel 10/12 :    0   1   3   7   5   4   6   2
005b3f0f299a:27:143 [4] NCCL INFO Trees [0] -1/-1/-1->4->7|7->4->-1/-1/-1 [1] -1/-1/-1->4->7|7->4->-1/-1/-1 [2] 7/-1/-1->4->0|0->4->7/-1/-1 [3] 7/-1/-1->4->0|0->4->7/-1/-1 [4] 6/-1/-1->4->5|5->4->6/-1/-1 [5] 5/-1/-1->4->6|6->4->5/-1/-1 [6] -1/-1/-1->4->7|7->4->-1/-1/-1 [7] -1/-1/-1->4->7|7->4->-1/-1/-1 [8] 7/-1/-1->4->0|0->4->7/-1/-1 [9] 7/-1/-1->4->0|0->4->7/-1/-1 [10] 6/-1/-1->4->5|5->4->6/-1/-1 [11] 5/-1/-1->4->6|6->4->5/-1/-1
005b3f0f299a:27:139 [0] NCCL INFO Channel 11/12 :    0   2   6   4   5   7   3   1
005b3f0f299a:27:145 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
005b3f0f299a:27:145 [6] NCCL INFO Trees [0] 7/-1/-1->6->5|5->6->7/-1/-1 [1] 7/-1/-1->6->5|5->6->7/-1/-1 [2] 5/-1/-1->6->7|7->6->5/-1/-1 [3] 5/-1/-1->6->7|7->6->5/-1/-1 [4] 2/-1/-1->6->4|4->6->2/-1/-1 [5] 4/-1/-1->6->2|2->6->4/-1/-1 [6] 7/-1/-1->6->5|5->6->7/-1/-1 [7] 7/-1/-1->6->5|5->6->7/-1/-1 [8] 5/-1/-1->6->7|7->6->5/-1/-1 [9] 5/-1/-1->6->7|7->6->5/-1/-1 [10] 2/-1/-1->6->4|4->6->2/-1/-1 [11] 4/-1/-1->6->2|2->6->4/-1/-1
005b3f0f299a:27:139 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
005b3f0f299a:27:139 [0] NCCL INFO Trees [0] 3/-1/-1->0->-1|-1->0->3/-1/-1 [1] 3/-1/-1->0->-1|-1->0->3/-1/-1 [2] 4/-1/-1->0->-1|-1->0->4/-1/-1 [3] 4/-1/-1->0->-1|-1->0->4/-1/-1 [4] 1/-1/-1->0->-1|-1->0->1/-1/-1 [5] 2/-1/-1->0->-1|-1->0->2/-1/-1 [6] 3/-1/-1->0->-1|-1->0->3/-1/-1 [7] 3/-1/-1->0->-1|-1->0->3/-1/-1 [8] 4/-1/-1->0->-1|-1->0->4/-1/-1 [9] 4/-1/-1->0->-1|-1->0->4/-1/-1 [10] 1/-1/-1->0->-1|-1->0->1/-1/-1 [11] 2/-1/-1->0->-1|-1->0->2/-1/-1
005b3f0f299a:27:146 [7] NCCL INFO Channel 00 : 7[1e0] -> 4[1b0] via P2P/direct pointer
[OMITTING BUNCH FOR BREVITY]
005b3f0f299a:27:141 [2] NCCL INFO Channel 11 : 2[190] -> 0[170] via P2P/direct pointer
005b3f0f299a:27:142 [3] NCCL INFO 12 coll channels, 16 p2p channels, 2 p2p channels per peer
005b3f0f299a:27:143 [4] NCCL INFO 12 coll channels, 16 p2p channels, 2 p2p channels per peer
005b3f0f299a:27:146 [7] NCCL INFO 12 coll channels, 16 p2p channels, 2 p2p channels per peer
005b3f0f299a:27:140 [1] NCCL INFO 12 coll channels, 16 p2p channels, 2 p2p channels per peer
005b3f0f299a:27:144 [5] NCCL INFO 12 coll channels, 16 p2p channels, 2 p2p channels per peer
005b3f0f299a:27:141 [2] NCCL INFO 12 coll channels, 16 p2p channels, 2 p2p channels per peer
005b3f0f299a:27:145 [6] NCCL INFO 12 coll channels, 16 p2p channels, 2 p2p channels per peer
005b3f0f299a:27:139 [0] NCCL INFO 12 coll channels, 16 p2p channels, 2 p2p channels per peer
005b3f0f299a:27:144 [5] NCCL INFO comm 0x7efd8c002e40 rank 5 nranks 8 cudaDev 5 busId 1c0 - Init COMPLETE
005b3f0f299a:27:143 [4] NCCL INFO comm 0x7efd98002e40 rank 4 nranks 8 cudaDev 4 busId 1b0 - Init COMPLETE
005b3f0f299a:27:141 [2] NCCL INFO comm 0x7efda0002e40 rank 2 nranks 8 cudaDev 2 busId 190 - Init COMPLETE
005b3f0f299a:27:140 [1] NCCL INFO comm 0x7efd9c002e40 rank 1 nranks 8 cudaDev 1 busId 180 - Init COMPLETE
005b3f0f299a:27:142 [3] NCCL INFO comm 0x7efd94002e40 rank 3 nranks 8 cudaDev 3 busId 1a0 - Init COMPLETE
005b3f0f299a:27:139 [0] NCCL INFO comm 0x7efda8002e40 rank 0 nranks 8 cudaDev 0 busId 170 - Init COMPLETE
005b3f0f299a:27:146 [7] NCCL INFO comm 0x7efd84002e40 rank 7 nranks 8 cudaDev 7 busId 1e0 - Init COMPLETE
005b3f0f299a:27:145 [6] NCCL INFO comm 0x7efd90002e40 rank 6 nranks 8 cudaDev 6 busId 1d0 - Init COMPLETE
005b3f0f299a:27:27 [0] NCCL INFO Launch mode Group/CGMD

mkserge avatar Mar 08 '23 21:03 mkserge

You figure this out? I am seeing slower speeds after upgrading, and perhaps it is other deps that were upgraded as well (MMdetection 2.x -> 3.3), but I happen to be going from the 1.9.1 docker image to the latest (2.1)...

matthost avatar Feb 23 '24 23:02 matthost