[Test] Fix zero1 gpu test
address https://github.com/pytorch/xla/issues/6260
@jeffhataws @JackCaoG can you trigger the CI?
@jeffhataws @JackCaoG can you trigger the CI?
Starting.
failure seems to be real on cpu and gpu
======================================================================
ERROR: test_zero1 (__main__.XlaZeRO1Test)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2739, in wrapper
method(*args, **kwargs)
File "/tmp/pytorch/xla/test/test_zero1.py", line 55, in test_zero1
xm.mark_step()
File "/opt/conda/lib/python3.8/site-packages/torch_xla-2.3.0+git6530adc-py3.8-linux-x86_64.egg/torch_xla/core/xla_model.py", line 907, in mark_step
torch_xla._XLAC._xla_step_marker(
RuntimeError: ./torch_xla/csrc/runtime/pjrt_computation_client.h:153 : Check failed: HasValue()
*** Begin stack trace ***
tsl::CurrentStackTrace[abi:cxx11]()
torch_xla::runtime::PjRtComputationClient::PjRtData::GetHandle()
torch::lazy::LazyGraphExecutor::RunPostOrder(std::vector<torch::lazy::Value, std::allocator<torch::lazy::Value> > const&, torch::lazy::LazyGraphExecutor::SyncTensorCollection*)
torch_xla::XLAGraphExecutor::RunPostOrder(std::vector<torch::lazy::Value, std::allocator<torch::lazy::Value> > const&, torch::lazy::LazyGraphExecutor::SyncTensorCollection*)
torch_xla::XLAGraphExecutor::SyncTensorsGraphInternal(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> >, std::allocator<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > > >*, absl::lts_20230802::Span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, torch::lazy::LazyGraphExecutor::SyncTensorsConfig const&, bool)
torch_xla::XLAGraphExecutor::SyncTensorsGraph(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> >, std::allocator<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > > >*, absl::lts_20230802::Span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, bool, bool, bool)
torch_xla::XLAGraphExecutor::SyncLiveTensorsGraph(torch::lazy::BackendDevice const*, c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, bool)
any idea of the actual error:
buffer with shape f32[8] on device CPU:0 is deleted
cpu test passed in nightly docker
Could that buffer be donated?
if it is donated it will showed as deleted, but the debug log I saw showed as null
Hmm buffer with shape f32[8] on device CPU:0 is deleted means buffer has been aliased to an output. Can you try to run it with XLA_ENABLE_PARAM_ALIASING=0 ? You can also try to rebase this pr as I recently change how aliasing works.