PiPPy
PiPPy copied to clipboard
ResNet example always underfitting when pippy training
I'm experimenting with pipelined training example of ResNet (pippy_resnet.py
) from https://github.com/pytorch/PiPPy/tree/main/examples/resnet. Namely, I want to compare the loss when running locally on one GPU and when running using pippy
. I add some basic WandB monitoring.
So, when running locally, everything is ok, it is clear that the loss drops, and the accuracy increases for each new epoch. But as soon as I switch to the example using pippy
, everything changes tremendously. Now the model simply does not train, the loss does not fall, and the accuracy is always around 0.1-0.2.
I would be very grateful if someone would explain why this is happening or what I'm doing wrong?
A piece of logs from the terminal from [local training]
Using device: cuda
Files already downloaded and verified
Epoch: 1
100%|█████████████████████████████████████| 500/500 [02:36<00:00, 3.20it/s]
Loader: train. Accuracy: 0.44814
100%|█████████████████████████████████████| 100/100 [00:10<00:00, 9.73it/s]
Loader: valid. Accuracy: 0.5681
Epoch: 2
100%|█████████████████████████████████████| 500/500 [02:36<00:00, 3.19it/s]
Loader: train. Accuracy: 0.61832
100%|█████████████████████████████████████| 100/100 [00:10<00:00, 9.20it/s]
Loader: valid. Accuracy: 0.622
Epoch: 3
100%|█████████████████████████████████████| 500/500 [02:44<00:00, 3.04it/s]
Loader: train. Accuracy: 0.69844
100%|█████████████████████████████████████| 100/100 [00:10<00:00, 9.20it/s]
Loader: valid. Accuracy: 0.6618
A piece of logs from the terminal from [pippy
training]
Epoch: 1
100%|███████████████████████████████████| 1250/1250 [02:34<00:00, 8.07it/s]
Loader: train. Accuracy: 0.1428
100%|█████████████████████████████████████| 250/250 [00:14<00:00, 17.49it/s]
Loader: valid. Accuracy: 0.106
Epoch: 2
100%|███████████████████████████████████| 1250/1250 [02:50<00:00, 7.32it/s]
Loader: train. Accuracy: 0.14936
100%|█████████████████████████████████████| 250/250 [00:19<00:00, 13.12it/s]
Loader: valid. Accuracy: 0.1333
Epoch: 3
100%|███████████████████████████████████| 1250/1250 [03:03<00:00, 6.83it/s]
Loader: train. Accuracy: 0.13552
100%|█████████████████████████████████████| 250/250 [00:19<00:00, 12.51it/s]
Loader: valid. Accuracy: 0.1225
How are you running the pippy training example?
Hey @hpc-unex, I am using torchrun
, namely
# On the first node:
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=MASTER_IP:MASTER_PORT pippy_resnet.py
# On the second node:
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=MASTER_IP:MASTER_PORT pippy_resnet.py
P.S.: I have two machines, each with 4 NVIDIA T4 GPUs
I'm running as the example. Calling with sbatch -> srun -> python, and im not having this issue since training on CPU is correctly working. On the other hand, with GPU training I got RPC problems:
1: [W tensorpipe_agent.cpp:726] RPC agent for worker1 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)
Also, I have the same problem as you for pippy training. Did you find a solution?
Also, I have the same problem as you for pippy training. Did you find a solution?
Hi @hpc-unex, unfortunately, not yet