nanoGPT
nanoGPT copied to clipboard
DistributedSampler
Hello,
Could someone explain to me how is the Dataset being divided between all the GPUs ? I know that Pytorch have something like DistributedSampler to do that, but I don't understand how we are doing that here on train.py. Thank you!
I have the same problem as you,do you have any ideas?
I believe I understood. So when we call init_process_group this will make all processes aware of themselves. Each processes will have their own random seed seed_offset = ddp_rank # each process gets a different seed
, and also the variable device is changed to device = f'cuda:{ddp_local_rank}'
. In the get_batch() each process will transfers the data to its own gpu when we call .to(device). Since each process has a different random seed, they will train mainly on different data. Finally in the while loop we make the synchonization before the backpropagation. Its good to mention that when we use pin_memory() on the data, this allows us to move them to GPU asynchronously when non_blocking=True
, the moving is also faster because of pin_memory(). The asynchronous moving is happening on the training loop when Karpathy calls the forward pass and right after the get_batch()... even this being python, PyTorch makes this part of the code running in parallel.
I am trying to train this on nvidia agx orin dev kit but getting error that module 'torch.distributed' has no attribute 'init_process_group'. Any help would be greatly appreciated
Thank you @caiodataopshouse that was helpful!