jeffhataws

Results 9 issues of jeffhataws

This PR fixes the "RuntimeError: No CUDA GPUs are available" when running with --bf16 option on Neuron. Related PRs: https://github.com/huggingface/transformers/pull/20684 https://github.com/huggingface/transformers/pull/22300 # What does this PR do? While PR #22300...

This PR updates XLA ZeRO1 implementation to use [allgather coalesed](https://github.com/pytorch/xla/pull/5950) and [reduce-scatter coalesced](https://github.com/pytorch/xla/pull/5956).

backport_2.2

## 🐛 Bug Currently ZeRO1 test/test_zero1.py is disabled for GPU since version 2.1 (https://github.com/pytorch/xla/pull/4912). We should reenable it for GPU to enable coverage for reduce-scatter/all-gather. When I tried with torch/xla...

## 🐛 Bug We use XLA_DISABLE_FUNCTIONALIZATION=1 in torch-xla 2.1 to workaround the trace slowdown issue (https://github.com/pytorch/xla/issues/6294). However, we are encountering a strange issue with the reproduction code in the next...

## 🐛 Bug When functionalization is on (XLA_DISABLE_FUNCTIONALIZATION=0), I see that there are fewer aliased tensors. Jack has a patch to increase the number of aliased tensors https://github.com/pytorch/xla/commit/e3fc03314dab5f44e3ed9ccbba6c15fbca3285cd . However,...

## 🐛 Bug With torch-xla v2.8, the Neuron team is getting "Check failed: state Expected an array shape." errors when running many training tests that uses reduce-scatter. These errors were...

bug
xla:cpu

## 🐛 Bug We have multiple unit tests (Neuron inference trace analyzer/bucketing) that failed with ``` #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 #1 0x0000764de392105c in absl::lts_20230802::synchronization_internal::FutexWaiter::WaitUntil(std::atomic*, int, absl::lts_20230802::synchronization_internal::KernelTimeout) () from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so...

bug
xla:neuron

## 🐛 Bug When running a small example to split 2D array in the second dimension, the resulting tensors don't have the expected data. The results are different between CPU...

pytorch divergence
functionalization-disabled

## 🐛 Bug When running the new DTensor placement test test/spmd/test_dtensor_integration3.py with functionalization on (default), I get the following error: ``` ====================================================================== ERROR: test_xla_placement (__main__.DTensorIntegrationTest3) ---------------------------------------------------------------------- Traceback (most recent call...

bug
distributed