distributed_tutorial
distributed_tutorial copied to clipboard
Thanks for the great tutorial. One thing I still don't understand: how are the master address and port determined? Is this set by my machine, i.e. if I have a...
where does dist.destroy_process_group() go in your DDP MNIST example: https://github.com/yangkky/distributed_tutorial/blob/master/src/mnist-mixed.py ?
I saw the tutorial (https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#save-and-load-checkpoints): ``` def demo_checkpoint(rank, world_size): print(f"Running DDP checkpoint example on rank {rank}.") setup(rank, world_size) model = ToyModel().to(rank) ddp_model = DDP(model, device_ids=[rank]) loss_fn = nn.MSELoss() optimizer =...
Hi, Thanks for the easy following tutorial on distributed processing. I followed your example, it works fine on a single multi-gpu system. On running it on multiple nodes with 2...
you write in you blog https://yangkky.github.io/2019/07/08/distributed-pytorch-tutorial.html - . It’s also possible to have multiple worker processes that fetch data for each GPU. How can I enable this? I am running...
I noticed that they use the "Save and Load Checkpoints" to synchronize all models in different process in the PyTorch tutorial https://pytorch.org/tutorials/intermediate/ddp_tutorial.html  So, I want to know if there...
How to add DDP with val evaluation? Is it same with train? @yangkky
https://github.com/yangkky/distributed_tutorial/blob/24467967c1c719110c33fccca69353ad8e5ae2e4/src/mnist-mixed.py#L108-L114 could you add the save model line to the example to be more complete. Thanks! ```python torch.save(model.state_dict(), CHECKPOINT_PATH) ```
Hi, thanks for the excellent example of using DistributedDataParallel in PyTorch; it is very easy to understand and is much better that Pytorch docs. One important bit that is missing...
Hi, I tried running my code like your example, and I got this error ``` File "artGAN512_impre_v8.py", line 286, in main mp.spawn(train, nprocs=args.gpus, args=(args,)) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn...