PiPPy ResNet example always underfitting when pippy training

I'm experimenting with pipelined training example of ResNet (pippy_resnet.py) from https://github.com/pytorch/PiPPy/tree/main/examples/resnet. Namely, I want to compare the loss when running locally on one GPU and when running using pippy. I add some basic WandB monitoring.

So, when running locally, everything is ok, it is clear that the loss drops, and the accuracy increases for each new epoch. But as soon as I switch to the example using pippy, everything changes tremendously. Now the model simply does not train, the loss does not fall, and the accuracy is always around 0.1-0.2.

I would be very grateful if someone would explain why this is happening or what I'm doing wrong?

A piece of logs from the terminal from [local training]

Using device: cuda
Files already downloaded and verified
Epoch: 1
100%|█████████████████████████████████████| 500/500 [02:36<00:00,  3.20it/s]
Loader: train. Accuracy: 0.44814
100%|█████████████████████████████████████| 100/100 [00:10<00:00,  9.73it/s]
Loader: valid. Accuracy: 0.5681
Epoch: 2
100%|█████████████████████████████████████| 500/500 [02:36<00:00,  3.19it/s]
Loader: train. Accuracy: 0.61832
100%|█████████████████████████████████████| 100/100 [00:10<00:00,  9.20it/s]
Loader: valid. Accuracy: 0.622
Epoch: 3
100%|█████████████████████████████████████| 500/500 [02:44<00:00,  3.04it/s]
Loader: train. Accuracy: 0.69844
100%|█████████████████████████████████████| 100/100 [00:10<00:00,  9.20it/s]
Loader: valid. Accuracy: 0.6618

A piece of logs from the terminal from [pippy training]

Epoch: 1
100%|███████████████████████████████████| 1250/1250 [02:34<00:00,  8.07it/s]
Loader: train. Accuracy: 0.1428
100%|█████████████████████████████████████| 250/250 [00:14<00:00, 17.49it/s]
Loader: valid. Accuracy: 0.106
Epoch: 2
100%|███████████████████████████████████| 1250/1250 [02:50<00:00,  7.32it/s]
Loader: train. Accuracy: 0.14936
100%|█████████████████████████████████████| 250/250 [00:19<00:00, 13.12it/s]
Loader: valid. Accuracy: 0.1333
Epoch: 3
100%|███████████████████████████████████| 1250/1250 [03:03<00:00,  6.83it/s]
Loader: train. Accuracy: 0.13552
100%|█████████████████████████████████████| 250/250 [00:19<00:00, 12.51it/s]
Loader: valid. Accuracy: 0.1225

Sep 05 '23 12:09 hakob-petro

How are you running the pippy training example?

Sep 08 '23 11:09 hpc-unex

Hey @hpc-unex, I am using torchrun, namely

# On the first node:
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=MASTER_IP:MASTER_PORT pippy_resnet.py

# On the second node:
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=MASTER_IP:MASTER_PORT pippy_resnet.py

P.S.: I have two machines, each with 4 NVIDIA T4 GPUs

Sep 09 '23 20:09 hakob-petro

I'm running as the example. Calling with sbatch -> srun -> python, and im not having this issue since training on CPU is correctly working. On the other hand, with GPU training I got RPC problems:

1: [W tensorpipe_agent.cpp:726] RPC agent for worker1 encountered error when reading incoming request from worker0: eof (this error originated at tensorpipe/transport/shm/connection_impl.cc:259)

Sep 11 '23 11:09 hpc-unex

Also, I have the same problem as you for pippy training. Did you find a solution?

Sep 13 '23 15:09 hpc-unex

Also, I have the same problem as you for pippy training. Did you find a solution?

Hi @hpc-unex, unfortunately, not yet

Sep 14 '23 06:09 hakob-petro

PiPPy PiPPy copied to clipboard

ResNet example always underfitting when pippy training

PiPPy
PiPPy copied to clipboard