xla [v2.8] Neuron inference trace analyzer/bucketing unit tests hanging at GetParameterIdTensorMapping/TransferFromDevice

🐛 Bug

We have multiple unit tests (Neuron inference trace analyzer/bucketing) that failed with

#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x0000764de392105c in absl::lts_20230802::synchronization_internal::FutexWaiter::WaitUntil(std::atomic<int>*, int, absl::lts_20230802::synchronization_internal::KernelTimeout) ()
   from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#2  0x0000764de3921122 in absl::lts_20230802::synchronization_internal::FutexWaiter::Wait(absl::lts_20230802::synchronization_internal::KernelTimeout) () from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#3  0x0000764de3921343 in AbslInternalPerThreadSemWait_lts_20230802 ()
   from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#4  0x0000764de3923053 in absl::lts_20230802::Mutex::Block(absl::lts_20230802::base_internal::PerThreadSynch*) ()
   from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#5  0x0000764dd8822cf7 in absl::lts_20230802::Mutex::LockSlowWithDeadline(absl::lts_20230802::MuHowS const*, absl::lts_20230802::Condition const*, absl::lts_20230802::synchronization_internal::KernelTimeout, int) [clone .cold] ()
   from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#6  0x0000764dd8822d0c in absl::lts_20230802::Mutex::LockSlow(absl::lts_20230802::MuHowS const*, absl::lts_20230802::Condition const*, int) () from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#7  0x0000764de3924552 in absl::lts_20230802::Notification::WaitForNotification() const ()
   from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#8  0x0000764de2067508 in tsl::BlockUntilReady(tsl::AsyncValue*) ()
   from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#9  0x0000764dd8f7b20e in torch_xla::runtime::PjRtComputationClient::TransferFromDevice(absl::lts_20230802::Span<std::shared_ptr<torch_xla::runtime::ComputationClient::Data> const>) ()
   from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#10 0x0000764dd898494c in torch_xla::(anonymous namespace)::PyLoweringContext::GetParameterIdTensorMapping() ()

After bisecting the torch-xla nightlies I narrowed down to commit https://github.com/pytorch/xla/commit/8dc5b496b05e7a25dc721fd23851480850ae3935 . Reverting this commit resolves the hang.

To Reproduce

Will work on a self-contained unit test to demonstrate the hang, as the above unit test depends on torch-neuronx.

Expected behavior

No hang

Environment

Reproducible on XLA backend [CPU/TPU/CUDA]: Neuron
torch_xla version: 2.8

Additional context

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-analyze.html#torch-neuronx-analyze-api

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-trace.html#torch-neuronx-trace-api

Jun 17 '25 17:06 jeffhataws

When did these tests start failing? https://github.com/pytorch/xla/pull/8849 was merged March 26th.

Also, could you explain a little more on what is causing the error? My understanding from the original PR (https://github.com/pytorch/xla/pull/8849) is that tensors would be copied when needed.

@ysiraichi might be able to comment on if the impact here of this rollback as you might have more context on the issue that caused the original PR.

Jun 18 '25 00:06 pgmoka

We suspect it could be a PJRT async issue, similar to this note in the original PR: https://github.com/pytorch/xla/pull/8849#issuecomment-2749554914

Jun 18 '25 03:06 jeffhataws

The impact, here, would be that we could see performance regressions. Basically, we were skipping creating another tensor (i.e. copying) whenever the tensor was already a contiguous tensor on CPU. Now, we won't skip it anymore, copying tensor data every time.

Jun 18 '25 11:06 ysiraichi

The impact, here, would be that we could see performance regressions. Basically, we were skipping creating another tensor (i.e. copying) whenever the tensor was already a contiguous tensor on CPU. Now, we won't skip it anymore, copying tensor data every time.

Thanks. Agree. Will continue to debug this. The revert is just a mitigation.

Jun 18 '25 18:06 jeffhataws