xla [Test] Fix zero1 gpu test

address https://github.com/pytorch/xla/issues/6260

Mar 06 '24 03:03 hgt312

@jeffhataws @JackCaoG can you trigger the CI?

Mar 06 '24 03:03 hgt312

@jeffhataws @JackCaoG can you trigger the CI?

Starting.

Mar 11 '24 15:03 jeffhataws

failure seems to be real on cpu and gpu

======================================================================
ERROR: test_zero1 (__main__.XlaZeRO1Test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2739, in wrapper
    method(*args, **kwargs)
  File "/tmp/pytorch/xla/test/test_zero1.py", line 55, in test_zero1
    xm.mark_step()
  File "/opt/conda/lib/python3.8/site-packages/torch_xla-2.3.0+git6530adc-py3.8-linux-x86_64.egg/torch_xla/core/xla_model.py", line 907, in mark_step
    torch_xla._XLAC._xla_step_marker(
RuntimeError: ./torch_xla/csrc/runtime/pjrt_computation_client.h:153 : Check failed: HasValue() 
*** Begin stack trace ***
	tsl::CurrentStackTrace[abi:cxx11]()
	torch_xla::runtime::PjRtComputationClient::PjRtData::GetHandle()
	torch::lazy::LazyGraphExecutor::RunPostOrder(std::vector<torch::lazy::Value, std::allocator<torch::lazy::Value> > const&, torch::lazy::LazyGraphExecutor::SyncTensorCollection*)
	torch_xla::XLAGraphExecutor::RunPostOrder(std::vector<torch::lazy::Value, std::allocator<torch::lazy::Value> > const&, torch::lazy::LazyGraphExecutor::SyncTensorCollection*)
	torch_xla::XLAGraphExecutor::SyncTensorsGraphInternal(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> >, std::allocator<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > > >*, absl::lts_20230802::Span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, torch::lazy::LazyGraphExecutor::SyncTensorsConfig const&, bool)
	torch_xla::XLAGraphExecutor::SyncTensorsGraph(std::vector<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> >, std::allocator<c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > > >*, absl::lts_20230802::Span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, bool, bool, bool)
	torch_xla::XLAGraphExecutor::SyncLiveTensorsGraph(torch::lazy::BackendDevice const*, c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, bool)

Mar 11 '24 20:03 JackCaoG

any idea of the actual error:

buffer with shape f32[8] on device CPU:0 is deleted

cpu test passed in nightly docker

Mar 12 '24 01:03 hgt312

Could that buffer be donated?

Mar 12 '24 22:03 alanwaketan

if it is donated it will showed as deleted, but the debug log I saw showed as null

Mar 12 '24 22:03 JackCaoG

Hmm buffer with shape f32[8] on device CPU:0 is deleted means buffer has been aliased to an output. Can you try to run it with XLA_ENABLE_PARAM_ALIASING=0 ? You can also try to rebase this pr as I recently change how aliasing works.

Mar 23 '24 00:03 JackCaoG