xla Can't print XLA tensors or call `cpu()`.

🐛 Bug

Recently I've been seeing this error whenever I try to run the following with PJRT_DEVICE=CPU (it works fine if I use CUDA).

>>> x = torch.rand(5, device="xla")
>>> x.cpu()

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
F0000 00:00:1745001283.205982     796 concurrent_vector.h:70] Check failed: index < state.size (65534 vs. 10)
*** Check failure stack trace: ***
    @     0x76d22da0e78d  absl::lts_20230802::log_internal::LogMessage::PrepareToDie()
    @     0x76d22da0e7fd  absl::lts_20230802::log_internal::LogMessage::SendToLog()
    @     0x76d22da0e280  absl::lts_20230802::log_internal::LogMessage::Flush()
    @     0x76d22da0eacc  absl::lts_20230802::log_internal::LogMessageFatal::~LogMessageFatal()
    @     0x76d218baadf4  tsl::internal::ConcurrentVector<>::operator[]()
    @     0x76d218baa1c8  tsl::AsyncValue::GetTypeInfo()
    @     0x76d218baa703  tsl::AsyncValue::Destroy()
    @     0x76d218baa3e7  tsl::AsyncValue::DropRef()
    @     0x76d218baa085  tsl::AsyncValue::DropRef()
    @     0x76d218baaeeb  tsl::RCReference<>::~RCReference()
    @     0x76d2194eb69a  tsl::AsyncValueRef<>::~AsyncValueRef()
    @     0x76d21c3e3f81  xla::cpu::ThunkExecutor::ExecuteSequential()
    @     0x76d21c3e3737  xla::cpu::ThunkExecutor::Execute()
    @     0x76d2194da10b  xla::TfrtCpuExecutable::ExecuteHelper()
    @     0x76d2194dd607  xla::TfrtCpuExecutable::ExecuteSharded()
    @     0x76d2194a5751  xla::PjRtLoadedExecutable::ExecuteSharded()
    @     0x76d21949e58b  torch_xla::runtime::PjRtComputationClient::ExecuteComputation()
    @     0x76d218eeeda6  torch_xla::XLAGraphExecutor::ScheduleSyncTensorsGraph()::{lambda()#1}::operator()()
    @     0x76d218ef6ab2  std::__invoke_impl<>()
    @     0x76d218ef6321  std::__invoke_r<>()
    @     0x76d218ef5b8d  std::_Function_handler<>::_M_invoke()
    @     0x76d3e7711c1c  std::function<>::operator()()
    @     0x76d3d5c8a0c9  torch::lazy::MultiWait::Complete()
    @     0x76d3d5c89e26  torch::lazy::MultiWait::Completer()::{lambda()#1}::operator()()
    @     0x76d3d5c8a85e  std::__invoke_impl<>()
    @     0x76d3d5c8a615  std::__invoke_r<>()
    @     0x76d3d5c8a41b  std::_Function_handler<>::_M_invoke()
    @     0x76d218a3c70e  std::function<>::operator()()
    @     0x76d22d71c0ca  tsl::thread::EigenEnvironment::ExecuteTask()
    @     0x76d22d71cef2  Eigen::ThreadPoolTempl<>::WorkerLoop()
    @     0x76d22d71c378  Eigen::ThreadPoolTempl<>::ThreadPoolTempl()::{lambda()#1}::operator()()
    @     0x76d22d71f06a  std::__invoke_impl<>()
    @     0x76d22d71ea02  std::__invoke_r<>()
    @     0x76d22d71da7d  std::_Function_handler<>::_M_invoke()
    @     0x76d218a3c70e  std::function<>::operator()()
    @     0x76d22d71be7b  tsl::thread::EigenEnvironment::CreateThread()::{lambda()#1}::operator()()
    @     0x76d22d71f42b  std::__invoke_impl<>()
    @     0x76d22d71f408  std::__invoke<>()
    @     0x76d22d71f3e5  std::invoke<>()
    @     0x76d22d71f3a6  absl::lts_20230802::internal_any_invocable::InvokeR<>()
    @     0x76d22d71f1ad  absl::lts_20230802::internal_any_invocable::RemoteInvoker<>()
    @     0x76d2194ef8ed  absl::lts_20230802::internal_any_invocable::Impl<>::operator()()
    @     0x76d22d6fa5ae  tsl::(anonymous namespace)::PThread::ThreadFn()
    @     0x76d3e9cfcea7  start_thread
Aborted (core dumped)

Environment

Reproducible on XLA backend [CPU/TPU/CUDA]: CPU
torch_xla version: 0bb4f6f01931fba78b18505d0414a85ae51b8171

Apr 18 '25 18:04 ysiraichi

@tengyifei @lsy323 @qihqi @bhavya01 Any ideas what might be happening, here?

Apr 23 '25 19:04 ysiraichi

I am also seeing this now:

#8  0x00007ffcf27eca28 in tsl::AsyncValue::GetTypeInfo (this=0x55555bf9e9c0) at external/xla/xla/tsl/concurrency/async_value.h:475
(gdb) p *this
$1 = {static kUnknownTypeId = 0, refcount_ = {<std::__atomic_base<unsigned int>> = {static _S_alignment = 4, _M_i = 1}, static is_always_lock_free = true}, kind_ = tsl::AsyncValue::Kind::kConcrete, has_vtable_ = false, is_refcounted_ = false, type_id_ = 65535, waiters_and_state_ = {static _S_min_alignment = 8, static _S_alignment = 8, _M_i = {static kStateMask = 3, static kPointerMask = 18446744073709551612, value = 2}, static is_always_lock_free = <optimized out>}, static kDataOffset = 64, static total_allocated_async_values_ = {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 12}, static is_always_lock_free = true}}

Maybe as one of the async values being dereferenced at https://github.com/openxla/xla/blob/86b2f51f8000326813fd9742aaac6bd1868cc19b/xla/pjrt/cpu/cpu_client.cc#L1446.

Seems a relatively serious issue since it can be any tensor print, as it hinders CPU development.

Apr 28 '25 21:04 rpsilva-aws

Hey @ysiraichi, do we have a path forward on this one? It would be great to be able to use CPU locally on the container.

May 13 '25 07:05 rpsilva-aws

Not really. I could not reproduce it on CI (#9048). Didn't have much time to investigate it myself.

May 13 '25 18:05 ysiraichi

It seems the problem occurs when DEBUG=1.

Jun 11 '25 20:06 jeffhataws