Results 35 comments of iefgnoix

I rebase my codebase to get the latest code on the main branch and here is the new [error](https://gist.github.com/vanbasten23/fa98d1fbd46bf32471c4677ca93bc754) that I got. It fails at https://github.com/huggingface/accelerate/blob/2ad42e77c3a1993dbfb9bc299c21bae2005c0572/src/accelerate/test_utils/scripts/test_script.py#L751. Looking like it failed...

I reverted the change locally in https://github.com/huggingface/accelerate/pull/2542/files#diff-d9858283a2ced902233727f6fddde0a00831ad9a66a069e57231a5057d550bf6 and I still got the same error.

@ysiraichi The regression you saw might be due to https://github.com/pytorch/xla/pull/6677 (open xla pin update). Our team is looking into this issue.

> Do we have bandwidth to test this one? Otherwise we can merge and see if DDP test started to fail tmr.... I'm running the tests in https://github.com/pytorch/xla/pull/6624#issuecomment-1984717508.

@ysiraichi sorry for the delayed response. I tested on my v3-8. Before this PR (master branch 6ac32233a238cfb351f9aa87dfd0308ecf547a96): ``` root@67df528db184:/ansible# PJRT_DEVICE=TPU python pytorch/xla/test/test_train_mp_imagenet.py --model=resnet50 --log_steps=200 --ddp --pjrt_distributed --fake_data --batch_size=256 Epoch 1...

I'm using cuda 12.1 and I didn't see the error. I got the trace this way: ``` # in my container root@xiowei-gpu:/ansible# PJRT_DEVICE=CUDA python pytorch/xla/test/spmd/test_train_spmd_imagenet.py --fake_data --batch_size 16 --model=resnet50 --sharding=batch...

With the profile I captured, I can see a very nice memory viewer: ![image](https://github.com/pytorch/xla/assets/5279639/06e3bb0c-ce93-4d7c-9f9b-e86b5b91446a) It's probably not as helpful as the torch memory viewer in terms of debugging OOM issues....

You can find tutorials here https://cloud.google.com/tpu/docs/pytorch-xla-performance-profiling-tpu-vm#profiling . And you may need to use the profiler from xla instead of from pytorch: `import torch_xla.debug.profiler as xp`

hey @mars1248 , I did see the same warning as yours with `--duration_ms 2000`. So I follow the warning "you may try to profile longer" by using a larger duration,...