Will Cromar comments

Results 22 comments of


                                            Will Cromar

Add profiler traces to `PjRtComputationClient`

This test specifically checks for the PJRT traces I added in this PR. This test is also more isolated (no dependency on another test) and uses a much smaller graph,...

Add profiler traces to `PjRtComputationClient`

1. @JackCaoG is correct. Each process only has two cores, so this is expected. I also expect starting multiple profiler servers and tracing all of them to show more cores,...

add aarch64 build scripts for torch and xla wheel building

I'm working on a broader refactoring of our build process in #4044, especially with an eye toward reducing the size of the final release image. I don't see anything in...

add aarch64 build scripts for torch and xla wheel building

We'll keep the current build scripts through 1.13. I believe the current build process uses `buster` like my WIP script (except for the current GPU build), so I don't expect...

Kaggle notebook doesn't work

Yeah, I expect the issue is that the PyTorch version in the notebook is not compatible with the latest versions of `diffusers` etc.

PyTorch/XLA should not crash runtime during int64 dot product on TPU

Crashing on this operation is one thing. Fallback to CPU to avoid the crash seems reasonable to me in this case. The _way_ we crash is the real issue IMO....

PyTorch/XLA should not crash runtime during int64 dot product on TPU

I reproduced the error to get the line number where we terminate the process: https://github.com/pytorch/xla/blob/6cc5c3819c09c7b1b4ca4927d7fa65133f95b41c/torch_xla/csrc/runtime/debug_macros.h#L20 So this has to do with our implementation of `XLA_CHECK_OK`. Ideally we want to make...

PyTorch/XLA should not crash runtime during int64 dot product on TPU

Here's what I found: - Absl's `CHECK_OK` and friends [terminate the program by design](https://github.com/abseil/abseil-cpp/blob/e968256406fd7898d7fde880e31e54b041d32a7e/absl/log/check.h#L153-L155) because [Google bans exceptions internally](https://abseil.io/tips/76). - Exceptions are what we want in this case because [pybind...

SIGSEGV: Segmentation Fault Memory error while checkpointing with transformers trainer on a v5-litepod-8 Google Cloud TPU

@alanwaketan do you normally use the HuggingFace `Trainer`? I remember people have had issues using it with XLA before. I ran through two of the [example](https://huggingface.co/docs/transformers/en/training) [tutorials](https://huggingface.co/docs/transformers/en/run_scripts) last week while...

Colab notebook link is broken

`xla/stream_executor/tpu/tpu_library_init_fns.inc` looks like a very outdated `libtpu` to me. We dropped support for StreamExecutor on PJRT about a year ago IIRC. Is this on the current Kaggle TPU VM environment?