Ricky Das comments

Results 13 comments of


                                            Ricky Das

Single Node Distributed Training

@JackCaoG Do you have any insights on this issue?

Single Node Distributed Training

PyTorch can do it because they are not doing distributed training using the trick XLA is doing it. For them it's native to torch itself, they are using DDP modules...

Single Node Distributed Training

Excellent! That should solve the problem for now. I will try it out and update it here. But should work for sure. It is true that the trick of CUDA_VISIBLE_DEVICES...

An error occurred when running run_generator.py

Yes.

nan loss

I am also facing this same issue. Essentially I get this problem when I try to train it in a distributed manner. I tried every adjustment of the regularization parameters,...

nan loss

Update on my issue: there was a problem with one of my GPUs in my multi node multi GPU setup. Some gate must have been broken.

[trcomp] [huggingface_pytorch] [build] Adding support for PyTorch 1.11

👍

[bug] OOM for GPU while using recommended batch_size

Closing this issue since there has been no activity.

[DETR support] Lower aten::_cdist_forward

I think we can use xla::norm directly. It supports p=0 and p=inf. So it should address all your comments

VecNormalize only for 1-D observations

Same here, I always comment out the Normalization while training