Will Cromar
Will Cromar
This test specifically checks for the PJRT traces I added in this PR. This test is also more isolated (no dependency on another test) and uses a much smaller graph,...
1. @JackCaoG is correct. Each process only has two cores, so this is expected. I also expect starting multiple profiler servers and tracing all of them to show more cores,...
I'm working on a broader refactoring of our build process in #4044, especially with an eye toward reducing the size of the final release image. I don't see anything in...
We'll keep the current build scripts through 1.13. I believe the current build process uses `buster` like my WIP script (except for the current GPU build), so I don't expect...
Yeah, I expect the issue is that the PyTorch version in the notebook is not compatible with the latest versions of `diffusers` etc.
Crashing on this operation is one thing. Fallback to CPU to avoid the crash seems reasonable to me in this case. The _way_ we crash is the real issue IMO....
I reproduced the error to get the line number where we terminate the process: https://github.com/pytorch/xla/blob/6cc5c3819c09c7b1b4ca4927d7fa65133f95b41c/torch_xla/csrc/runtime/debug_macros.h#L20 So this has to do with our implementation of `XLA_CHECK_OK`. Ideally we want to make...
Here's what I found: - Absl's `CHECK_OK` and friends [terminate the program by design](https://github.com/abseil/abseil-cpp/blob/e968256406fd7898d7fde880e31e54b041d32a7e/absl/log/check.h#L153-L155) because [Google bans exceptions internally](https://abseil.io/tips/76). - Exceptions are what we want in this case because [pybind...
@alanwaketan do you normally use the HuggingFace `Trainer`? I remember people have had issues using it with XLA before. I ran through two of the [example](https://huggingface.co/docs/transformers/en/training) [tutorials](https://huggingface.co/docs/transformers/en/run_scripts) last week while...
`xla/stream_executor/tpu/tpu_library_init_fns.inc` looks like a very outdated `libtpu` to me. We dropped support for StreamExecutor on PJRT about a year ago IIRC. Is this on the current Kaggle TPU VM environment?