amithrm comments

Results 16 comments of


                                            amithrm

Adding support for torchrun in xla backend

@JackCaoG sure..will add tests

Adding support for torchrun in xla backend

@JackCaoG I changed the initialization a bit to take into account how slurm configures the devices. Please take a look at it and also the test cases. All of these...

Adding support for torchrun in xla backend

we did some internal testing. It appears that at scale, we see issues with the set up of GRPC channels. We should understand if you see similar issues at your...

Adding support for torchrun in xla backend

@JackCaoG A simple test that you can run on GPU-XLA: ``` import sys import torch import torch_xla import torch_xla.core.xla_model as xm import torch_xla.distributed.xla_multiprocessing as xmp import os def _mp_fn(index): print('XRT_LOCAL_WORKER:{}'.format(os.environ['XRT_LOCAL_WORKER']))...

Adding support for torchrun in xla backend

Run command: GPU_NUM_DEVICES=2 python3 allreduce_xla.py This will output: XRT_LOCAL_WORKER:localservice:0 XRT_DEVICE_MAP:GPU:0;/job:localservice/replica:0/task:0/device:XLA_GPU:0|GPU:1;/job:localservice/replica:0/task:1/device:XLA_GPU:0 XRT_WORKERS:localservice:0;grpc://dfda805bbe4b:49887|localservice:1;grpc://dfda805bbe4b:33097 XRT_LOCAL_WORKER:localservice:1 XRT_DEVICE_MAP:GPU:0;/job:localservice/replica:0/task:0/device:XLA_GPU:0|GPU:1;/job:localservice/replica:0/task:1/device:XLA_GPU:0 XRT_WORKERS:localservice:0;grpc://dfda805bbe4b:49887|localservice:1;grpc://dfda805bbe4b:33097 If you look for XRT_WORKERS, this has the grpc string for each worker. This won't scale...

Cannot run 7-node PT distributed test

A simple code to try on GPUs: ``` import sys import torch import torch_xla import torch_xla.core.xla_model as xm import torch_xla.distributed.xla_multiprocessing as xmp import os def _mp_fn(index): print('XRT_LOCAL_WORKER:{}'.format(os.environ['XRT_LOCAL_WORKER'])) print('XRT_DEVICE_MAP:{}'.format(os.environ['XRT_DEVICE_MAP'])) print('XRT_WORKERS:{}'.format(os.environ['XRT_WORKERS'])) print('XRT_HOST_WORLD_SIZE:{}'.format(os.environ['XRT_HOST_WORLD_SIZE']))...