distributed_tutorial issues

how to determine master address and port?

Thanks for the great tutorial. One thing I still don't understand: how are the master address and port determined? Is this set by my machine, i.e. if I have a...

taylorwwebb

where does dist.destroy_process_group() go in your DDP MNIST example?

where does dist.destroy_process_group() go in your DDP MNIST example: https://github.com/yangkky/distributed_tutorial/blob/master/src/mnist-mixed.py ?

brando90

How to do mnist-distributed with checkpointing?

1

I saw the tutorial (https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#save-and-load-checkpoints): ``` def demo_checkpoint(rank, world_size): print(f"Running DDP checkpoint example on rank {rank}.") setup(rank, world_size) model = ToyModel().to(rank) ddp_model = DDP(model, device_ids=[rank]) loss_fn = nn.MSELoss() optimizer =...

brando90

Error distributed run

3

Hi, Thanks for the easy following tutorial on distributed processing. I followed your example, it works fine on a single multi-gpu system. On running it on multiple nodes with 2...

snash4

multiple dataloader processes with ddp

you write in you blog https://yangkky.github.io/2019/07/08/distributed-pytorch-tutorial.html - . It’s also possible to have multiple worker processes that fetch data for each GPU. How can I enable this? I am running...

ParthaEth

Hi, a little bit confuse about your code, please give me some help.

2

I noticed that they use the "Save and Load Checkpoints" to synchronize all models in different process in the PyTorch tutorial https://pytorch.org/tutorials/intermediate/ddp_tutorial.html ![image](https://user-images.githubusercontent.com/53101398/69497127-1a2d8080-0e9f-11ea-985d-3c6e095af6ed.png) So, I want to know if there...

tbwxmu

How to add DDP with val loader?

2

How to add DDP with val evaluation? Is it same with train? @yangkky

cswwp

save or load checkpoint

https://github.com/yangkky/distributed_tutorial/blob/24467967c1c719110c33fccca69353ad8e5ae2e4/src/mnist-mixed.py#L108-L114 could you add the save model line to the example to be more complete. Thanks! ```python torch.save(model.state_dict(), CHECKPOINT_PATH) ```

rancheng

Call set_epoch on DistributedSampler

Hi, thanks for the excellent example of using DistributedDataParallel in PyTorch; it is very easy to understand and is much better that Pytorch docs. One important bit that is missing...

tanhevg

Error with distributed mp

1

Hi, I tried running my code like your example, and I got this error ``` File "artGAN512_impre_v8.py", line 286, in main mp.spawn(train, nprocs=args.gpus, args=(args,)) File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn...

jyu-theartofml

distributed_tutorial
distributed_tutorial copied to clipboard

Metadata

how to determine master address and port?

where does dist.destroy_process_group() go in your DDP MNIST example?

How to do mnist-distributed with checkpointing?

Error distributed run

multiple dataloader processes with ddp

Hi, a little bit confuse about your code, please give me some help.

How to add DDP with val loader?

save or load checkpoint

Call set_epoch on DistributedSampler

Error with distributed mp

← Metadata

Owner

Metadata

distributed_tutorial distributed_tutorial copied to clipboard

Metadata

← Metadata

Owner

Metadata

distributed_tutorial
distributed_tutorial copied to clipboard