jeffhataws issues

Results 9 issues of


                                            jeffhataws

Restore bf16 support for Neuron after PR #22300

This PR fixes the "RuntimeError: No CUDA GPUs are available" when running with --bf16 option on Neuron. Related PRs: https://github.com/huggingface/transformers/pull/20684 https://github.com/huggingface/transformers/pull/22300 # What does this PR do? While PR #22300...

ZeRO1: Add bucketting logic to control the size of tensors for all-gather/reduce-scatter

This PR updates XLA ZeRO1 implementation to use [allgather coalesed](https://github.com/pytorch/xla/pull/5950) and [reduce-scatter coalesced](https://github.com/pytorch/xla/pull/5956).

backport_2.2

Need to reenable ZeRO1 for GPU to enable coverage for reduce-scatter/all-gather

## 🐛 Bug Currently ZeRO1 test/test_zero1.py is disabled for GPU since version 2.1 (https://github.com/pytorch/xla/pull/4912). We should reenable it for GPU to enable coverage for reduce-scatter/all-gather. When I tried with torch/xla...

Extracted subarrray's device is 'lazy' instead of 'xla' when using ellipsis extraction with XLA_DISABLE_FUNCTIONALIZATION=1

## 🐛 Bug We use XLA_DISABLE_FUNCTIONALIZATION=1 in torch-xla 2.1 to workaround the trace slowdown issue (https://github.com/pytorch/xla/issues/6294). However, we are encountering a strange issue with the reproduction code in the next...

[torch-xla 2.1 - 2.4] when functionalization is on, there are no aliasing for gradients when using gradient accumulation

## 🐛 Bug When functionalization is on (XLA_DISABLE_FUNCTIONALIZATION=0), I see that there are fewer aliased tensors. Jack has a patch to increase the number of aliased tensors https://github.com/pytorch/xla/commit/e3fc03314dab5f44e3ed9ccbba6c15fbca3285cd . However,...

[torch-xla v2.8] Error "Check failed: state Expected an array shape." when running test/test_mp_reduce_scatter.py

## 🐛 Bug With torch-xla v2.8, the Neuron team is getting "Check failed: state Expected an array shape." errors when running many training tests that uses reduce-scatter. These errors were...

bug

xla:cpu

[v2.8] Neuron inference trace analyzer/bucketing unit tests hanging at GetParameterIdTensorMapping/TransferFromDevice

## 🐛 Bug We have multiple unit tests (Neuron inference trace analyzer/bucketing) that failed with ``` #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 #1 0x0000764de392105c in absl::lts_20230802::synchronization_internal::FutexWaiter::WaitUntil(std::atomic*, int, absl::lts_20230802::synchronization_internal::KernelTimeout) () from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so...

bug

xla:neuron

split on second dimension of 2D array not working with XLA_DISABLE_FUNCTIONALIZATION=1

## 🐛 Bug When running a small example to split 2D array in the second dimension, the resulting tensors don't have the expected data. The results are different between CPU...

pytorch divergence

functionalization-disabled

"RuntimeError: !at::functionalization::impl::isFunctionalTensor(t)" when running a DTensor test with functionalization on

## 🐛 Bug When running the new DTensor placement test test/spmd/test_dtensor_integration3.py with functionalization on (default), I get the following error: ``` ====================================================================== ERROR: test_xla_placement (__main__.DTensorIntegrationTest3) ---------------------------------------------------------------------- Traceback (most recent call...

bug

distributed