pytorch-transformer-distributed
pytorch-transformer-distributed copied to clipboard
Distributed training (multi-node) of a Transformer model
pytorch-transformer-distributed
Distributed training of an attention model. Forked from: hkproj/pytorch-transformer
Instructions for Paperspace
Machines
Make sure to create everything in the same region. I used East Coast (NY2)
.
- Create 1x Private network. Assign both computers to the private network when creating the machines.
- Create 2x nodes of
P4000x2
(multi-GPU) withML-in-a-Box
as operating system - Create 1 Network drive (250 GB)
Setup
Login on each machine and perform the following operations:
-
sudo apt-get update
-
sudo apt-get install net-tools
- If you get an error about
seahorse
while installingnet-tools
, do the following:- sudo rm /var/lib/dpkg/info/seahorse.list
- sudo apt-get install seahorse --reinstall
- Get each machine's private IP address using
ifconfig
- Add IP and hostname mapping of all the slave nodes on
/etc/hosts
file of the master node - Mount the network drive
-
sudo apt-get install smbclient
-
sudo apt-get install cifs-utils
-
sudo mkdir /mnt/training-data
- Replace the following values on the command below:
-
NETWORD_DRIVE_IP
with the IP address of the network drive -
NETWORK_SHARE_NAME
with the name of the network share -
DRIVE_USERNAME
with the username of the network drive
-
-
sudo mount -t cifs //NETWORD_DRIVE_IP/NETWORK_SHARE_NAME /mnt/training-data -o uid=1000,gid=1000,rw,user,username=NETWORK_DRIVE_USERNAME
- Type the drive's password when prompted
-
-
git clone https://github.com/hkproj/pytorch-transformer-distributed
-
cd pytorch-transformer-distributed
-
pip install -r requirements.txt
- Login on Weights & Biases
-
wandb login
- Copy the API key from the browser and paste it on the terminal
-
- Run the training command from below
Local training
Run the following command on any machine. Make sure to not run it on both, otherwise they will end up overwriting each other's checkpoints.
torchrun --nproc_per_node=2 --nnodes=1 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:48123 train.py --batch_size 8 --model_folder "/mnt/training-data/weights"
Distributed training
Run the following command on each machine (replace IP_ADDR_MASTER_NODE
with the IP address of the master node):
torchrun --nproc_per_node=2 --nnodes=2 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=IP_ADDR_MASTER_NODE:48123 train.py --batch_size 8 --model_folder "/mnt/training-data/weights"
Monitoring
Login to Weights & Biases to monitor the training progress: https://app.wandb.ai/