[v2.8] Neuron inference trace analyzer/bucketing unit tests hanging at GetParameterIdTensorMapping/TransferFromDevice
🐛 Bug
We have multiple unit tests (Neuron inference trace analyzer/bucketing) that failed with
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x0000764de392105c in absl::lts_20230802::synchronization_internal::FutexWaiter::WaitUntil(std::atomic<int>*, int, absl::lts_20230802::synchronization_internal::KernelTimeout) ()
from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#2 0x0000764de3921122 in absl::lts_20230802::synchronization_internal::FutexWaiter::Wait(absl::lts_20230802::synchronization_internal::KernelTimeout) () from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#3 0x0000764de3921343 in AbslInternalPerThreadSemWait_lts_20230802 ()
from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#4 0x0000764de3923053 in absl::lts_20230802::Mutex::Block(absl::lts_20230802::base_internal::PerThreadSynch*) ()
from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#5 0x0000764dd8822cf7 in absl::lts_20230802::Mutex::LockSlowWithDeadline(absl::lts_20230802::MuHowS const*, absl::lts_20230802::Condition const*, absl::lts_20230802::synchronization_internal::KernelTimeout, int) [clone .cold] ()
from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#6 0x0000764dd8822d0c in absl::lts_20230802::Mutex::LockSlow(absl::lts_20230802::MuHowS const*, absl::lts_20230802::Condition const*, int) () from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#7 0x0000764de3924552 in absl::lts_20230802::Notification::WaitForNotification() const ()
from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#8 0x0000764de2067508 in tsl::BlockUntilReady(tsl::AsyncValue*) ()
from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#9 0x0000764dd8f7b20e in torch_xla::runtime::PjRtComputationClient::TransferFromDevice(absl::lts_20230802::Span<std::shared_ptr<torch_xla::runtime::ComputationClient::Data> const>) ()
from /home/ubuntu/aws_neuron_venv/lib/python3.10/site-packages/_XLAC.cpython-310-x86_64-linux-gnu.so
#10 0x0000764dd898494c in torch_xla::(anonymous namespace)::PyLoweringContext::GetParameterIdTensorMapping() ()
After bisecting the torch-xla nightlies I narrowed down to commit https://github.com/pytorch/xla/commit/8dc5b496b05e7a25dc721fd23851480850ae3935 . Reverting this commit resolves the hang.
To Reproduce
Will work on a self-contained unit test to demonstrate the hang, as the above unit test depends on torch-neuronx.
Expected behavior
No hang
Environment
- Reproducible on XLA backend [CPU/TPU/CUDA]: Neuron
- torch_xla version: 2.8
Additional context
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-analyze.html#torch-neuronx-analyze-api
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/api-reference-guide/inference/api-torch-neuronx-trace.html#torch-neuronx-trace-api
When did these tests start failing? https://github.com/pytorch/xla/pull/8849 was merged March 26th.
Also, could you explain a little more on what is causing the error? My understanding from the original PR (https://github.com/pytorch/xla/pull/8849) is that tensors would be copied when needed.
@ysiraichi might be able to comment on if the impact here of this rollback as you might have more context on the issue that caused the original PR.
We suspect it could be a PJRT async issue, similar to this note in the original PR: https://github.com/pytorch/xla/pull/8849#issuecomment-2749554914
The impact, here, would be that we could see performance regressions. Basically, we were skipping creating another tensor (i.e. copying) whenever the tensor was already a contiguous tensor on CPU. Now, we won't skip it anymore, copying tensor data every time.
The impact, here, would be that we could see performance regressions. Basically, we were skipping creating another tensor (i.e. copying) whenever the tensor was already a contiguous tensor on CPU. Now, we won't skip it anymore, copying tensor data every time.
Thanks. Agree. Will continue to debug this. The revert is just a mitigation.