Ananth Subramaniam
Ananth Subramaniam
I believe this will be a prerequisite for further consolidation efforts like https://github.com/PyTorchLightning/pytorch-lightning/pull/11021 and https://github.com/PyTorchLightning/pytorch-lightning/pull/11020 by moving up the `setup_environment` of DDP and TPU spawn
Merging through https://github.com/pytorch/tnt/pull/179
You might find this library useful for such primitives, especially to support distributed checkpointing: https://github.com/pytorch/torchsnapshot @yifuwang
I discussed this more with @rohan-varma - DDP join docs: https://pytorch.org/docs/stable/_modules/torch/nn/parallel/distributed.html#DistributedDataParallel.join > This module currently does not support custom distributed collective operations in the forward pass, such as SyncBatchNorm or...
cc @kandluis @aazolini @yifuwang who were also curious if there's a serialization format we stick to for the state dict, or if the contents of the state dict are considered...
Hi @ZhiyuanChen , thanks for creating this issue! Could you point to memory increases if using multiple metrics like AUROC and AUPRC? If so, as pointed out, this implementation be...
@yifuwang was this fixed by https://github.com/pytorch/torchsnapshot/pull/104 ?
@carmocca I cannot edit this issue. Would you please remove myself, @ninginthecloud, @edward-io, and @jjenniferdai from being tagged? thanks!
> A question I have about the usage, in DDP user should call Snapshot.take by all ranks ? Yes, Snapshot.take should always be called on all ranks in a distributed...
Prior issue: https://github.com/PyTorchLightning/pytorch-lightning/issues/3337