Will Cromar comments

Results 22 comments of


                                            Will Cromar

TPU Pod support with PjRt

Simple test case to sanity-check that collectives work as expected: ``` $ gcloud compute tpus tpu-vm ssh --project=tpu-pytorch --zone=us-central2-b wcromar-v4-32 --internal-ip --worker=all --command 'PJRT_DEVICE=TPU python3 -c " import torch_xla.core.xla_model as...

TPU Pod support with PjRt

@ronghanghu Thanks for flagging the issue with the other collectives. I did check `all_gather` as well, but I didn't think to try with `pin_layout=False`. This snippet gives the expected results:...

TPU Pod support with PjRt

Also,`xm.rendezvous` doesn't work yet, but we had another early tester tell us that they were able to work around it by creating a `gloo` process group and using `dist.barrier`

TPU Pod support with PjRt

`barrier` will almost certainly not work with threads if you use the global default process group (i.e. use `init_process_group`), because each thread will use the same PG. It might work...

TPU Pod support with PjRt

Thanks @JackCaoG. I'll work on a README showing how to port from XRT to PjRt and how to run models without `xla_dist`.

allowing `xm.get_ordinal()` and default device in `xm.xla_device()` in PJRT

Thanks @ronghanghu for the awesome detail on this issue! You touched on a few issues here. For now, I'll focus on implementing `xm.get_ordinal`, `xm.xla_device`, etc. to have the right default...

Add profiler traces to `PjRtComputationClient`

The new test passed on CPU but flaked on GPU: ``` 2022-08-10 23:44:46.343597: W 562435 tensorflow/core/profiler/lib/profiler_session.cc:107] Profiling is late by 1594605 nanoseconds and will start immediately. 2022-08-10 23:44:48.736212: W 562436...

Add profiler traces to `PjRtComputationClient`

The GPU test has `Profiling is late by 1594605 nanoseconds` vs CPU with `Profiling is late by 760266 nanoseconds`. Maybe the XLA execution had finished by the time tracing had...

Add profiler traces to `PjRtComputationClient`

There is some lag between when the tracer thread starts and when it actually starts tracing, long enough that XLA execution can finish before it starts. This test was only...

Add profiler traces to `PjRtComputationClient`

Test still flaked on GPU (even though it's using CPU) and I can't reproduce the error locally. Removing it from the CI tests since I'll have to add a TPU...