xla Need to reenable ZeRO1 for GPU to enable coverage for reduce-scatter/all-gather

🐛 Bug

Currently ZeRO1 test/test_zero1.py is disabled for GPU since version 2.1 (https://github.com/pytorch/xla/pull/4912). We should reenable it for GPU to enable coverage for reduce-scatter/all-gather.

When I tried with torch/xla version 2.2 (sha 7c46e4c5), I hit a segmenation fault:

----------------------------------------------------------------------                           
Ran 1 test in 1.428s                                                                                                                      
                                                                                                                                          
OK                                                                                                                                        
Segmentation fault (core dumped)

To Reproduce

Steps to reproduce the behavior:

Build torch/xla as in https://github.com/pytorch/xla/blob/master/docs/gpu.md
Edit test/test_zero1.py and remove/comment-out the line that starts with

@unittest.skipIf(pjrt.device_type() == 'GPU',
                   "TODO(alanwaketan): Fix it for the token change.")

Run the test

GPU_NUM_DEVICES=1 PJRT_DEVICE=CUDA python  test/test_zero1.py
GPU_NUM_DEVICES=2 PJRT_DEVICE=CUDA python  test/test_zero1.py

Expected behavior

Test runs and passes on GPUs without segfault

Environment

Reproducible on XLA backend [CPU/TPU]: GPU/CUDA
torch_xla version: 2.1

Additional context

Jan 04 '24 18:01 jeffhataws

Thanks Jeff for filing the issue.

Jan 04 '24 18:01 alanwaketan

Back-trace for GPU_NUM_DEVICES=2 case:

#0  0x00005605c2e3e550 in ?? ()                                                                                                   [0/1876]
#1  0x00007f3bf28153cb in xla::TrackedDeviceBuffer::~TrackedDeviceBuffer() ()                                                             
   from /usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git3dce325-py3.8-linux-x86_64.egg/_XLAC.cpython-38-x86_64-linux-gnu.so     
#2  0x00007f3bf27e53d0 in absl::lts_20230802::internal_statusor::StatusOrData<std::shared_ptr<xla::TrackedDeviceBuffer> >::~StatusOrData() () from /usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git3dce325-py3.8-linux-x86_64.egg/_XLAC.cpython-38-x86_64-linux-gnu.so    
#3  0x00007f3bf27f6607 in xla::PjRtStreamExecutorBuffer::Delete() ()
   from /usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git3dce325-py3.8-linux-x86_64.egg/_XLAC.cpython-38-x86_64-linux-gnu.so     
#4  0x00007f3bf27f6773 in xla::PjRtStreamExecutorBuffer::~PjRtStreamExecutorBuffer() ()                                                   
   from /usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git3dce325-py3.8-linux-x86_64.egg/_XLAC.cpython-38-x86_64-linux-gnu.so     
#5  0x00007f3bf27f6942 in xla::PjRtStreamExecutorBuffer::~PjRtStreamExecutorBuffer() ()                                                   
   from /usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git3dce325-py3.8-linux-x86_64.egg/_XLAC.cpython-38-x86_64-linux-gnu.so     
#6  0x00007f3bf2235cfa in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() ()                                              
   from /usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git3dce325-py3.8-linux-x86_64.egg/_XLAC.cpython-38-x86_64-linux-gnu.so     
#7  0x00007f3bf26b0efb in torch_xla::runtime::PjRtComputationClient::PjRtData::~PjRtData() ()                                             
   from /usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git3dce325-py3.8-linux-x86_64.egg/_XLAC.cpython-38-x86_64-linux-gnu.so     
#8  0x00007f3bf25efd5a in torch_xla::DeviceData::~DeviceData() ()
   from /usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git3dce325-py3.8-linux-x86_64.egg/_XLAC.cpython-38-x86_64-linux-gnu.so     
#9  0x00007f3f5044b66e in torch::lazy::Node::~Node() () from /usr/local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so             
#10 0x00007f3f5044b66e in torch::lazy::Node::~Node() () from /usr/local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so             
#11 0x00007f3f5044b66e in torch::lazy::Node::~Node() () from /usr/local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so             
#12 0x00007f3f5044b66e in torch::lazy::Node::~Node() () from /usr/local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so             
#13 0x00007f3f5044b66e in torch::lazy::Node::~Node() () from /usr/local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so             
#14 0x00007f3f5044b66e in torch::lazy::Node::~Node() () from /usr/local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so             
#15 0x00007f3f5044b66e in torch::lazy::Node::~Node() () from /usr/local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so             
#16 0x00007f3f5044b66e in torch::lazy::Node::~Node() () from /usr/local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so             
#17 0x00007f3f5044b66e in torch::lazy::Node::~Node() () from /usr/local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so             
#18 0x00007f3bf222d4b2 in std::_Sp_counted_ptr_inplace<torch::lazy::Value, std::allocator<torch::lazy::Value>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()                                                    
   from /usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git3dce325-py3.8-linux-x86_64.egg/_XLAC.cpython-38-x86_64-linux-gnu.so     
#19 0x00007f3bf231b011 in std::unordered_map<long, std::shared_ptr<torch::lazy::Value>, std::hash<long>, std::equal_to<long>, std::allocator<std::pair<long const, std::shared_ptr<torch::lazy::Value> > > >::~unordered_map() ()                                                   
   from /usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git3dce325-py3.8-linux-x86_64.egg/_XLAC.cpython-38-x86_64-linux-gnu.so     
#20 0x00007f3f58b114d7 in __run_exit_handlers (status=0, listp=0x7f3f58ca4718 <__exit_funcs>,                                             
    run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108                                             
#21 0x00007f3f58b1167a in __GI_exit (status=<optimized out>) at exit.c:139                                                                
#22 0x00007f3f58db90fb in Py_Exit (sts=0) at Python/pylifecycle.c:2295                                                                    
#23 0x00007f3f58dbb300 in handle_system_exit () at Python/pythonrun.c:684                                                                 
#24 0x00007f3f58dbb499 in _PyErr_PrintEx (tstate=0x5605bf08f710, set_sys_last_vars=set_sys_last_vars@entry=1) at Python/pythonrun.c:694   
#25 0x00007f3f58dbb79d in PyErr_PrintEx (set_sys_last_vars=set_sys_last_vars@entry=1) at Python/pythonrun.c:789                           
#26 0x00007f3f58dbb7a7 in PyErr_Print () at Python/pythonrun.c:795
#27 0x00007f3f58dba443 in pyrun_simple_file (flags=0x7ffd34b760a8, closeit=<optimized out>, filename='test_zero1.py', fp=<optimized out>) 
    at Python/pythonrun.c:445
#28 PyRun_SimpleFileExFlags (fp=<optimized out>, filename=<optimized out>, closeit=<optimized out>, flags=0x7ffd34b760a8)   
    at Python/pythonrun.c:472
#29 0x00007f3f58e7d9f0 in pymain_run_file (cf=0x7ffd34b760a8, config=0x5605bf08ea20) at Modules/main.c:385
#30 pymain_run_python (exitcode=0x7ffd34b760a0) at Modules/main.c:610
#31 Py_RunMain () at Modules/main.c:689
#32 0x00007f3f58e7d609 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:743
#33 0x00007f3f58af9d0a in __libc_start_main (main=0x5605be94c050 <main>, argc=2, argv=0x7ffd34b762b8, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffd34b762a8) at ../csu/libc-start.c:308
#34 0x00005605be94c08a in _start ()

Jan 05 '24 04:01 jeffhataws

From the backtrace, it seems like a double release...

Jan 05 '24 18:01 alanwaketan

Backtrace with pytorch also built with DEBUG=1:

#0  0x00007fa57f42363a in xla::LocalDeviceState::allocation_model() const ()
   from /usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git7c46e4c-py3.8-linux-x86_64.egg/_XLAC.cpython-38-x86_64-linux-gnu.so
#1  0x00007fa57f402d9f in xla::PjRtStreamExecutorBuffer::Release(bool) ()
   from /usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git7c46e4c-py3.8-linux-x86_64.egg/_XLAC.cpython-38-x86_64-linux-gnu.so
#2  0x00007fa57f4032ac in xla::PjRtStreamExecutorBuffer::Delete() ()
   from /usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git7c46e4c-py3.8-linux-x86_64.egg/_XLAC.cpython-38-x86_64-linux-gnu.so
#3  0x00007fa57f4026bb in xla::PjRtStreamExecutorBuffer::~PjRtStreamExecutorBuffer() ()
   from /usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git7c46e4c-py3.8-linux-x86_64.egg/_XLAC.cpython-38-x86_64-linux-gnu.so
#4  0x00007fa57f4027d6 in xla::PjRtStreamExecutorBuffer::~PjRtStreamExecutorBuffer() ()
   from /usr/local/lib/python3.8/site-packages/torch_xla-2.2.0+git7c46e4c-py3.8-linux-x86_64.egg/_XLAC.cpython-38-x86_64-linux-gnu.so
#5  0x00007fa57f11c3f2 in std::default_delete<xla::PjRtBuffer>::operator() (this=0x55b83fe433e0, __ptr=0x55b840790960)
    at /usr/include/c++/10/bits/unique_ptr.h:85
#6  0x00007fa57f130414 in std::_Sp_counted_deleter<xla::PjRtBuffer*, std::default_delete<xla::PjRtBuffer>, std::allocator<void>, (__gnu_cx
x::_Lock_policy)2>::_M_dispose (this=0x55b83fe433d0) at /usr/include/c++/10/bits/shared_ptr_base.h:474
#7  0x00007fa57e8c444d in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55b83fe433d0)
    at /usr/include/c++/10/bits/shared_ptr_base.h:158
#8  0x00007fa57e8af613 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x55b840bc6d20, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:733
#9  0x00007fa57f10d40e in std::__shared_ptr<xla::PjRtBuffer, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x55b840bc6d18, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1183
#10 0x00007fa57f10d42a in std::shared_ptr<xla::PjRtBuffer>::~shared_ptr (this=0x55b840bc6d18, __in_chrg=<optimized out>)
    at /usr/include/c++/10/bits/shared_ptr.h:121
#11 0x00007fa57f12fb30 in torch_xla::runtime::PjRtComputationClient::PjRtData::~PjRtData (this=0x55b840bc6b00, __in_chrg=<optimized out>)
    at ./torch_xla/csrc/runtime/pjrt_computation_client.h:131
#12 0x00007fa57f130843 in __gnu_cxx::new_allocator<torch_xla::runtime::PjRtComputationClient::PjRtData>::destroy<torch_xla::runtime::PjRtC
omputationClient::PjRtData> (this=0x55b840bc6b00, __p=0x55b840bc6b00) at /usr/include/c++/10/ext/new_allocator.h:156
#13 0x00007fa57f1305cd in std::allocator_traits<std::allocator<torch_xla::runtime::PjRtComputationClient::PjRtData> >::destroy<torch_xla::
runtime::PjRtComputationClient::PjRtData> (__a=..., __p=0x55b840bc6b00) at /usr/include/c++/10/bits/alloc_traits.h:531
#14 0x00007fa57f13016b in std::_Sp_counted_ptr_inplace<torch_xla::runtime::PjRtComputationClient::PjRtData, std::allocator<torch_xla::runt
ime::PjRtComputationClient::PjRtData>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x55b840bc6af0)
    at /usr/include/c++/10/bits/shared_ptr_base.h:560
#15 0x00007fa57e8c444d in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55b840bc6af0)
    at /usr/include/c++/10/bits/shared_ptr_base.h:158
#16 0x00007fa57e8af613 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x55b84027fd88, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:733
#17 0x00007fa57e8a60ec in std::__shared_ptr<torch::lazy::BackendData, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x55b84027fd80, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1183
#18 0x00007fa57e8a6108 in std::shared_ptr<torch::lazy::BackendData>::~shared_ptr (this=0x55b84027fd80, __in_chrg=<optimized out>)
    at /usr/include/c++/10/bits/shared_ptr.h:121
#19 0x00007fa57ef9fb7e in torch_xla::DeviceData::~DeviceData (this=0x55b84027fae0, __in_chrg=<optimized out>)
    at ./torch_xla/csrc/ops/device_data.h:11
#20 0x00007fa57ead4ef5 in __gnu_cxx::new_allocator<torch_xla::DeviceData>::destroy<torch_xla::DeviceData> (this=0x55b84027fae0, 
    __p=0x55b84027fae0) at /usr/include/c++/10/ext/new_allocator.h:156
#21 0x00007fa57ead4e59 in std::allocator_traits<std::allocator<torch_xla::DeviceData> >::destroy<torch_xla::DeviceData> (__a=..., 
    __p=0x55b84027fae0) at /usr/include/c++/10/bits/alloc_traits.h:531
#22 0x00007fa57ead4cc7 in std::_Sp_counted_ptr_inplace<torch_xla::DeviceData, std::allocator<torch_xla::DeviceData>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x55b84027fad0) at /usr/include/c++/10/bits/shared_ptr_base.h:560
#23 0x00007fa8c59aff65 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55b84027fad0)
    at /usr/include/c++/10/bits/shared_ptr_base.h:158
#24 0x00007fa8c59ae92b in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7f9a0000b908, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:733
#25 0x00007fa8ca9efe54 in std::__shared_ptr<torch::lazy::Node, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7f9a0000b900, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1183
#26 0x00007fa8ca9efe70 in std::shared_ptr<torch::lazy::Node>::~shared_ptr (this=0x7f9a0000b900, __in_chrg=<optimized out>)
    at /usr/include/c++/10/bits/shared_ptr.h:121
#27 0x00007fa8cb76e10b in std::_Destroy<std::shared_ptr<torch::lazy::Node> > (__pointer=0x7f9a0000b900)
    at /usr/include/c++/10/bits/stl_construct.h:140
#28 0x00007fa8cb76d412 in std::_Destroy_aux<false>::__destroy<std::shared_ptr<torch::lazy::Node>*> (__first=0x7f9a0000b900, 
    __last=0x7f9a0000b910) at /usr/include/c++/10/bits/stl_construct.h:152
#29 0x00007fa8cb76c1f2 in std::_Destroy<std::shared_ptr<torch::lazy::Node>*> (__first=0x7f9a0000b900, __last=0x7f9a0000b910)
    at /usr/include/c++/10/bits/stl_construct.h:185
#30 0x00007fa8cb76b377 in std::_Destroy<std::shared_ptr<torch::lazy::Node>*, std::shared_ptr<torch::lazy::Node> > (
    __first=0x7f9a0000b900, __last=0x7f9a0000b910) at /usr/include/c++/10/bits/alloc_traits.h:738
#31 0x00007fa8cb76a639 in std::vector<std::shared_ptr<torch::lazy::Node>, std::allocator<std::shared_ptr<torch::lazy::Node> > >::~vector
    (this=0x7f9a0000c5c8, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/stl_vector.h:680
#32 0x00007fa8cb768faa in torch::lazy::Node::~Node (this=0x7f9a0000c550, __in_chrg=<optimized out>)
    at /home/ubuntu/jthuynh/pytorch/torch/csrc/lazy/core/ir.cpp:108
#33 0x00007fa57f0c0016 in torch_xla::XlaNode::~XlaNode (this=0x7f9a0000c550, __in_chrg=<optimized out>) at torch_xla/csrc/ir.cpp:111
#34 0x00007fa57f064520 in torch_xla::Permute::~Permute (this=0x7f9a0000c550, __in_chrg=<optimized out>)
    at ./torch_xla/csrc/ops/permute.h:9
#35 0x00007fa57eb68355 in __gnu_cxx::new_allocator<torch_xla::Permute>::destroy<torch_xla::Permute> (this=0x7f9a0000c550, 
    __p=0x7f9a0000c550) at /usr/include/c++/10/ext/new_allocator.h:156
#36 0x00007fa57eb667dd in std::allocator_traits<std::allocator<torch_xla::Permute> >::destroy<torch_xla::Permute> (__a=..., 
    __p=0x7f9a0000c550) at /usr/include/c++/10/bits/alloc_traits.h:531
#37 0x00007fa57eb5ea21 in std::_Sp_counted_ptr_inplace<torch_xla::Permute, std::allocator<torch_xla::Permute>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x7f9a0000c540) at /usr/include/c++/10/bits/shared_ptr_base.h:560
#38 0x00007fa8c59aff65 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7f9a0000c540)
    at /usr/include/c++/10/bits/shared_ptr_base.h:158
#39 0x00007fa8c59ae92b in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7f9a0000ba98, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:733
#40 0x00007fa8ca9efe54 in std::__shared_ptr<torch::lazy::Node, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7f9a0000ba90, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1183
#41 0x00007fa8ca9efe70 in std::shared_ptr<torch::lazy::Node>::~shared_ptr (this=0x7f9a0000ba90, __in_chrg=<optimized out>)
    at /usr/include/c++/10/bits/shared_ptr.h:121
#42 0x00007fa8cb76e10b in std::_Destroy<std::shared_ptr<torch::lazy::Node> > (__pointer=0x7f9a0000ba90)
    at /usr/include/c++/10/bits/stl_construct.h:140
#43 0x00007fa8cb76d412 in std::_Destroy_aux<false>::__destroy<std::shared_ptr<torch::lazy::Node>*> (__first=0x7f9a0000ba90, 
    __last=0x7f9a0000bab0) at /usr/include/c++/10/bits/stl_construct.h:152
#44 0x00007fa8cb76c1f2 in std::_Destroy<std::shared_ptr<torch::lazy::Node>*> (__first=0x7f9a0000ba90, __last=0x7f9a0000bab0)
    at /usr/include/c++/10/bits/stl_construct.h:185
#45 0x00007fa8cb76b377 in std::_Destroy<std::shared_ptr<torch::lazy::Node>*, std::shared_ptr<torch::lazy::Node> > (
    __first=0x7f9a0000ba90, __last=0x7f9a0000bab0) at /usr/include/c++/10/bits/alloc_traits.h:738
#46 0x00007fa8cb76a639 in std::vector<std::shared_ptr<torch::lazy::Node>, std::allocator<std::shared_ptr<torch::lazy::Node> > >::~vector
    (this=0x7f9a0000cc38, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/stl_vector.h:680
#47 0x00007fa8cb768faa in torch::lazy::Node::~Node (this=0x7f9a0000cbc0, __in_chrg=<optimized out>)
    at /home/ubuntu/jthuynh/pytorch/torch/csrc/lazy/core/ir.cpp:108
#48 0x00007fa57f0c0016 in torch_xla::XlaNode::~XlaNode (this=0x7f9a0000cbc0, __in_chrg=<optimized out>) at torch_xla/csrc/ir.cpp:111
#49 0x00007fa57efb9d68 in torch_xla::Generic::~Generic (this=0x7f9a0000cbc0, __in_chrg=<optimized out>)
    at ./torch_xla/csrc/ops/generic.h:13
#50 0x00007fa57ef773f3 in __gnu_cxx::new_allocator<torch_xla::Generic>::destroy<torch_xla::Generic> (this=0x7f9a0000cbc0, 
    __p=0x7f9a0000cbc0) at /usr/include/c++/10/ext/new_allocator.h:156
#51 0x00007fa57ef773bf in std::allocator_traits<std::allocator<torch_xla::Generic> >::destroy<torch_xla::Generic> (__a=..., 
    __p=0x7f9a0000cbc0) at /usr/include/c++/10/bits/alloc_traits.h:531
#52 0x00007fa57ef772ad in std::_Sp_counted_ptr_inplace<torch_xla::Generic, std::allocator<torch_xla::Generic>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x7f9a0000cbb0) at /usr/include/c++/10/bits/shared_ptr_base.h:560
#53 0x00007fa8c59aff65 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7f9a0000cbb0)
    at /usr/include/c++/10/bits/shared_ptr_base.h:158
#54 0x00007fa8c59ae92b in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7f9a00008548, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:733
#55 0x00007fa8ca9efe54 in std::__shared_ptr<torch::lazy::Node, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7f9a00008540, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1183
#56 0x00007fa8ca9efe70 in std::shared_ptr<torch::lazy::Node>::~shared_ptr (this=0x7f9a00008540, __in_chrg=<optimized out>)
    at /usr/include/c++/10/bits/shared_ptr.h:121
#57 0x00007fa8cb76e10b in std::_Destroy<std::shared_ptr<torch::lazy::Node> > (__pointer=0x7f9a00008540)
    at /usr/include/c++/10/bits/stl_construct.h:140
#58 0x00007fa8cb76d412 in std::_Destroy_aux<false>::__destroy<std::shared_ptr<torch::lazy::Node>*> (__first=0x7f9a00008540, 
    __last=0x7f9a00008550) at /usr/include/c++/10/bits/stl_construct.h:152
#59 0x00007fa8cb76c1f2 in std::_Destroy<std::shared_ptr<torch::lazy::Node>*> (__first=0x7f9a00008540, __last=0x7f9a00008550)
    at /usr/include/c++/10/bits/stl_construct.h:185
#60 0x00007fa8cb76b377 in std::_Destroy<std::shared_ptr<torch::lazy::Node>*, std::shared_ptr<torch::lazy::Node> > (
    __first=0x7f9a00008540, __last=0x7f9a00008550) at /usr/include/c++/10/bits/alloc_traits.h:738
#61 0x00007fa8cb76a639 in std::vector<std::shared_ptr<torch::lazy::Node>, std::allocator<std::shared_ptr<torch::lazy::Node> > >::~vector
    (this=0x7f9a0000dde8, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/stl_vector.h:680
#62 0x00007fa8cb768faa in torch::lazy::Node::~Node (this=0x7f9a0000dd70, __in_chrg=<optimized out>)
    at /home/ubuntu/jthuynh/pytorch/torch/csrc/lazy/core/ir.cpp:108
#63 0x00007fa57f0c0016 in torch_xla::XlaNode::~XlaNode (this=0x7f9a0000dd70, __in_chrg=<optimized out>) at torch_xla/csrc/ir.cpp:111
#64 0x00007fa57f064520 in torch_xla::Permute::~Permute (this=0x7f9a0000dd70, __in_chrg=<optimized out>)
    at ./torch_xla/csrc/ops/permute.h:9
#65 0x00007fa57eb68355 in __gnu_cxx::new_allocator<torch_xla::Permute>::destroy<torch_xla::Permute> (this=0x7f9a0000dd70, 
    __p=0x7f9a0000dd70) at /usr/include/c++/10/ext/new_allocator.h:156
#66 0x00007fa57eb667dd in std::allocator_traits<std::allocator<torch_xla::Permute> >::destroy<torch_xla::Permute> (__a=..., 
    __p=0x7f9a0000dd70) at /usr/include/c++/10/bits/alloc_traits.h:531
#67 0x00007fa57eb5ea21 in std::_Sp_counted_ptr_inplace<torch_xla::Permute, std::allocator<torch_xla::Permute>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x7f9a0000dd60) at /usr/include/c++/10/bits/shared_ptr_base.h:560
#68 0x00007fa8c59aff65 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7f9a0000dd60)
    at /usr/include/c++/10/bits/shared_ptr_base.h:158
#69 0x00007fa8c59ae92b in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x55b840c37dd8, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:733
#70 0x00007fa8ca9efe54 in std::__shared_ptr<torch::lazy::Node, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x55b840c37dd0, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1183
#71 0x00007fa8ca9efe70 in std::shared_ptr<torch::lazy::Node>::~shared_ptr (this=0x55b840c37dd0, __in_chrg=<optimized out>)
    at /usr/include/c++/10/bits/shared_ptr.h:121
#72 0x00007fa8cb76e10b in std::_Destroy<std::shared_ptr<torch::lazy::Node> > (__pointer=0x55b840c37dd0)
    at /usr/include/c++/10/bits/stl_construct.h:140
#73 0x00007fa8cb76d412 in std::_Destroy_aux<false>::__destroy<std::shared_ptr<torch::lazy::Node>*> (__first=0x55b840c37dd0, 
    __last=0x55b840c37df0) at /usr/include/c++/10/bits/stl_construct.h:152
#74 0x00007fa8cb76c1f2 in std::_Destroy<std::shared_ptr<torch::lazy::Node>*> (__first=0x55b840c37dd0, __last=0x55b840c37df0)
    at /usr/include/c++/10/bits/stl_construct.h:185
#75 0x00007fa8cb76b377 in std::_Destroy<std::shared_ptr<torch::lazy::Node>*, std::shared_ptr<torch::lazy::Node> > (
    __first=0x55b840c37dd0, __last=0x55b840c37df0) at /usr/include/c++/10/bits/alloc_traits.h:738
#76 0x00007fa8cb76a639 in std::vector<std::shared_ptr<torch::lazy::Node>, std::allocator<std::shared_ptr<torch::lazy::Node> > >::~vector
    (this=0x55b840bd3208, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/stl_vector.h:680
#77 0x00007fa8cb768faa in torch::lazy::Node::~Node (this=0x55b840bd3190, __in_chrg=<optimized out>)
    at /home/ubuntu/jthuynh/pytorch/torch/csrc/lazy/core/ir.cpp:108
#78 0x00007fa57f0c0016 in torch_xla::XlaNode::~XlaNode (this=0x55b840bd3190, __in_chrg=<optimized out>) at torch_xla/csrc/ir.cpp:111
#79 0x00007fa57f07216e in torch_xla::ReduceScatter::~ReduceScatter (this=0x55b840bd3190, __in_chrg=<optimized out>)
    at ./torch_xla/csrc/ops/reduce_scatter.h:9
#80 0x00007fa57eb692f9 in __gnu_cxx::new_allocator<torch_xla::ReduceScatter>::destroy<torch_xla::ReduceScatter> (this=0x55b840bd3190, 
    __p=0x55b840bd3190) at /usr/include/c++/10/ext/new_allocator.h:156
#81 0x00007fa57eb67b1d in std::allocator_traits<std::allocator<torch_xla::ReduceScatter> >::destroy<torch_xla::ReduceScatter> (__a=..., 
    __p=0x55b840bd3190) at /usr/include/c++/10/bits/alloc_traits.h:531
#82 0x00007fa57eb65a9f in std::_Sp_counted_ptr_inplace<torch_xla::ReduceScatter, std::allocator<torch_xla::ReduceScatter>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x55b840bd3180) at /usr/include/c++/10/bits/shared_ptr_base.h:560
#83 0x00007fa8c59aff65 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55b840bd3180)
    at /usr/include/c++/10/bits/shared_ptr_base.h:158
#84 0x00007fa8c59ae92b in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x55b840c37d48, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:733
#85 0x00007fa8ca9efe54 in std::__shared_ptr<torch::lazy::Node, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x55b840c37d40, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1183
#86 0x00007fa8ca9efe70 in std::shared_ptr<torch::lazy::Node>::~shared_ptr (this=0x55b840c37d40, __in_chrg=<optimized out>)
    at /usr/include/c++/10/bits/shared_ptr.h:121
#87 0x00007fa8cb76e10b in std::_Destroy<std::shared_ptr<torch::lazy::Node> > (__pointer=0x55b840c37d40)
    at /usr/include/c++/10/bits/stl_construct.h:140
#88 0x00007fa8cb76d412 in std::_Destroy_aux<false>::__destroy<std::shared_ptr<torch::lazy::Node>*> (__first=0x55b840c37d40, 
    __last=0x55b840c37d50) at /usr/include/c++/10/bits/stl_construct.h:152
#89 0x00007fa8cb76c1f2 in std::_Destroy<std::shared_ptr<torch::lazy::Node>*> (__first=0x55b840c37d30, __last=0x55b840c37d50)
    at /usr/include/c++/10/bits/stl_construct.h:185
#90 0x00007fa8cb76b377 in std::_Destroy<std::shared_ptr<torch::lazy::Node>*, std::shared_ptr<torch::lazy::Node> > (
    __first=0x55b840c37d30, __last=0x55b840c37d50) at /usr/include/c++/10/bits/alloc_traits.h:738
#91 0x00007fa8cb76a639 in std::vector<std::shared_ptr<torch::lazy::Node>, std::allocator<std::shared_ptr<torch::lazy::Node> > >::~vector
    (this=0x55b840bf3eb8, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/stl_vector.h:680
#92 0x00007fa8cb768faa in torch::lazy::Node::~Node (this=0x55b840bf3e40, __in_chrg=<optimized out>)
    at /home/ubuntu/jthuynh/pytorch/torch/csrc/lazy/core/ir.cpp:108
#93 0x00007fa57f0c0016 in torch_xla::XlaNode::~XlaNode (this=0x55b840bf3e40, __in_chrg=<optimized out>) at torch_xla/csrc/ir.cpp:111
#94 0x00007fa57f07216e in torch_xla::ReduceScatter::~ReduceScatter (this=0x55b840bf3e40, __in_chrg=<optimized out>)
    at ./torch_xla/csrc/ops/reduce_scatter.h:9
#95 0x00007fa57eb692f9 in __gnu_cxx::new_allocator<torch_xla::ReduceScatter>::destroy<torch_xla::ReduceScatter> (this=0x55b840bf3e40, 
    __p=0x55b840bf3e40) at /usr/include/c++/10/ext/new_allocator.h:156
#96 0x00007fa57eb67b1d in std::allocator_traits<std::allocator<torch_xla::ReduceScatter> >::destroy<torch_xla::ReduceScatter> (__a=..., 
    __p=0x55b840bf3e40) at /usr/include/c++/10/bits/alloc_traits.h:531
#97 0x00007fa57eb65a9f in std::_Sp_counted_ptr_inplace<torch_xla::ReduceScatter, std::allocator<torch_xla::ReduceScatter>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x55b840bf3e30) at /usr/include/c++/10/bits/shared_ptr_base.h:560
#98 0x00007fa8c59aff65 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55b840bf3e30)
    at /usr/include/c++/10/bits/shared_ptr_base.h:158
#99 0x00007fa8c59ae92b in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x55b84156a718, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:733
#100 0x00007fa8ca9efe54 in std::__shared_ptr<torch::lazy::Node, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x55b84156a710, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1183
#101 0x00007fa8ca9efe70 in std::shared_ptr<torch::lazy::Node>::~shared_ptr (this=0x55b84156a710, __in_chrg=<optimized out>)
    at /usr/include/c++/10/bits/shared_ptr.h:121
#102 0x00007fa8cb76e10b in std::_Destroy<std::shared_ptr<torch::lazy::Node> > (__pointer=0x55b84156a710)
    at /usr/include/c++/10/bits/stl_construct.h:140
#103 0x00007fa8cb76d412 in std::_Destroy_aux<false>::__destroy<std::shared_ptr<torch::lazy::Node>*> (__first=0x55b84156a710, 
    __last=0x55b84156a720) at /usr/include/c++/10/bits/stl_construct.h:152
#104 0x00007fa8cb76c1f2 in std::_Destroy<std::shared_ptr<torch::lazy::Node>*> (__first=0x55b84156a700, __last=0x55b84156a720)
    at /usr/include/c++/10/bits/stl_construct.h:185
#105 0x00007fa8cb76b377 in std::_Destroy<std::shared_ptr<torch::lazy::Node>*, std::shared_ptr<torch::lazy::Node> > (
    __first=0x55b84156a700, __last=0x55b84156a720) at /usr/include/c++/10/bits/alloc_traits.h:738
#106 0x00007fa8cb76a639 in std::vector<std::shared_ptr<torch::lazy::Node>, std::allocator<std::shared_ptr<torch::lazy::Node> > >::~vector
    (this=0x55b8426e9248, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/stl_vector.h:680
#107 0x00007fa8cb768faa in torch::lazy::Node::~Node (this=0x55b8426e91d0, __in_chrg=<optimized out>)
    at /home/ubuntu/jthuynh/pytorch/torch/csrc/lazy/core/ir.cpp:108
#108 0x00007fa57f0c0016 in torch_xla::XlaNode::~XlaNode (this=0x55b8426e91d0, __in_chrg=<optimized out>) at torch_xla/csrc/ir.cpp:111
#109 0x00007fa57f07216e in torch_xla::ReduceScatter::~ReduceScatter (this=0x55b8426e91d0, __in_chrg=<optimized out>)
    at ./torch_xla/csrc/ops/reduce_scatter.h:9
#110 0x00007fa57eb692f9 in __gnu_cxx::new_allocator<torch_xla::ReduceScatter>::destroy<torch_xla::ReduceScatter> (this=0x55b8426e91d0, 
    __p=0x55b8426e91d0) at /usr/include/c++/10/ext/new_allocator.h:156
#111 0x00007fa57eb67b1d in std::allocator_traits<std::allocator<torch_xla::ReduceScatter> >::destroy<torch_xla::ReduceScatter> (__a=..., 
    __p=0x55b8426e91d0) at /usr/include/c++/10/bits/alloc_traits.h:531
#112 0x00007fa57eb65a9f in std::_Sp_counted_ptr_inplace<torch_xla::ReduceScatter, std::allocator<torch_xla::ReduceScatter>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x55b8426e91c0) at /usr/include/c++/10/bits/shared_ptr_base.h:560
#113 0x00007fa8c59aff65 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55b8426e91c0)
    at /usr/include/c++/10/bits/shared_ptr_base.h:158
#114 0x00007fa8c59ae92b in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x55b84155a618, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:733
#115 0x00007fa8ca9efe54 in std::__shared_ptr<torch::lazy::Node, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x55b84155a610, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1183
#116 0x00007fa8ca9efe70 in std::shared_ptr<torch::lazy::Node>::~shared_ptr (this=0x55b84155a610, __in_chrg=<optimized out>)
    at /usr/include/c++/10/bits/shared_ptr.h:121
#117 0x00007fa8cb76e10b in std::_Destroy<std::shared_ptr<torch::lazy::Node> > (__pointer=0x55b84155a610)
    at /usr/include/c++/10/bits/stl_construct.h:140
#118 0x00007fa8cb76d412 in std::_Destroy_aux<false>::__destroy<std::shared_ptr<torch::lazy::Node>*> (__first=0x55b84155a610, 
    __last=0x55b84155a620) at /usr/include/c++/10/bits/stl_construct.h:152
#119 0x00007fa8cb76c1f2 in std::_Destroy<std::shared_ptr<torch::lazy::Node>*> (__first=0x55b84155a600, __last=0x55b84155a620)
    at /usr/include/c++/10/bits/stl_construct.h:185
#120 0x00007fa8cb76b377 in std::_Destroy<std::shared_ptr<torch::lazy::Node>*, std::shared_ptr<torch::lazy::Node> > (
    __first=0x55b84155a600, __last=0x55b84155a620) at /usr/include/c++/10/bits/alloc_traits.h:738
#121 0x00007fa8cb76a639 in std::vector<std::shared_ptr<torch::lazy::Node>, std::allocator<std::shared_ptr<torch::lazy::Node> > >::~vector
    (this=0x55b840c4eb88, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/stl_vector.h:680
#122 0x00007fa8cb768faa in torch::lazy::Node::~Node (this=0x55b840c4eb10, __in_chrg=<optimized out>)
    at /home/ubuntu/jthuynh/pytorch/torch/csrc/lazy/core/ir.cpp:108
#123 0x00007fa57f0c0016 in torch_xla::XlaNode::~XlaNode (this=0x55b840c4eb10, __in_chrg=<optimized out>) at torch_xla/csrc/ir.cpp:111
#124 0x00007fa57f07216e in torch_xla::ReduceScatter::~ReduceScatter (this=0x55b840c4eb10, __in_chrg=<optimized out>)
    at ./torch_xla/csrc/ops/reduce_scatter.h:9
#125 0x00007fa57eb692f9 in __gnu_cxx::new_allocator<torch_xla::ReduceScatter>::destroy<torch_xla::ReduceScatter> (this=0x55b840c4eb10, 
    __p=0x55b840c4eb10) at /usr/include/c++/10/ext/new_allocator.h:156
#126 0x00007fa57eb67b1d in std::allocator_traits<std::allocator<torch_xla::ReduceScatter> >::destroy<torch_xla::ReduceScatter> (__a=..., 
    __p=0x55b840c4eb10) at /usr/include/c++/10/bits/alloc_traits.h:531
#127 0x00007fa57eb65a9f in std::_Sp_counted_ptr_inplace<torch_xla::ReduceScatter, std::allocator<torch_xla::ReduceScatter>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x55b840c4eb00) at /usr/include/c++/10/bits/shared_ptr_base.h:560
#128 0x00007fa8c59aff65 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55b840c4eb00)
    at /usr/include/c++/10/bits/shared_ptr_base.h:158
#129 0x00007fa8c59ae92b in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x55b8426ac5d8, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:733
#130 0x00007fa8ca9efe54 in std::__shared_ptr<torch::lazy::Node, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x55b8426ac5d0, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1183
#131 0x00007fa8ca9efe70 in std::shared_ptr<torch::lazy::Node>::~shared_ptr (this=0x55b8426ac5d0, __in_chrg=<optimized out>)
    at /usr/include/c++/10/bits/shared_ptr.h:121
#132 0x00007fa8cb76e10b in std::_Destroy<std::shared_ptr<torch::lazy::Node> > (__pointer=0x55b8426ac5d0)
    at /usr/include/c++/10/bits/stl_construct.h:140
#133 0x00007fa8cb76d412 in std::_Destroy_aux<false>::__destroy<std::shared_ptr<torch::lazy::Node>*> (__first=0x55b8426ac5d0, 
    __last=0x55b8426ac5e0) at /usr/include/c++/10/bits/stl_construct.h:152
#134 0x00007fa8cb76c1f2 in std::_Destroy<std::shared_ptr<torch::lazy::Node>*> (__first=0x55b8426ac5c0, __last=0x55b8426ac5e0)
    at /usr/include/c++/10/bits/stl_construct.h:185
#135 0x00007fa8cb76b377 in std::_Destroy<std::shared_ptr<torch::lazy::Node>*, std::shared_ptr<torch::lazy::Node> > (
    __first=0x55b8426ac5c0, __last=0x55b8426ac5e0) at /usr/include/c++/10/bits/alloc_traits.h:738
#136 0x00007fa8cb76a639 in std::vector<std::shared_ptr<torch::lazy::Node>, std::allocator<std::shared_ptr<torch::lazy::Node> > >::~vector
    (this=0x55b842a63598, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/stl_vector.h:680
#137 0x00007fa8cb768faa in torch::lazy::Node::~Node (this=0x55b842a63520, __in_chrg=<optimized out>)
    at /home/ubuntu/jthuynh/pytorch/torch/csrc/lazy/core/ir.cpp:108
#138 0x00007fa57f0c0016 in torch_xla::XlaNode::~XlaNode (this=0x55b842a63520, __in_chrg=<optimized out>) at torch_xla/csrc/ir.cpp:111
#139 0x00007fa57f07216e in torch_xla::ReduceScatter::~ReduceScatter (this=0x55b842a63520, __in_chrg=<optimized out>)
    at ./torch_xla/csrc/ops/reduce_scatter.h:9
#140 0x00007fa57eb692f9 in __gnu_cxx::new_allocator<torch_xla::ReduceScatter>::destroy<torch_xla::ReduceScatter> (this=0x55b842a63520, 
    __p=0x55b842a63520) at /usr/include/c++/10/ext/new_allocator.h:156
#141 0x00007fa57eb67b1d in std::allocator_traits<std::allocator<torch_xla::ReduceScatter> >::destroy<torch_xla::ReduceScatter> (__a=..., 
    __p=0x55b842a63520) at /usr/include/c++/10/bits/alloc_traits.h:531
#142 0x00007fa57eb65a9f in std::_Sp_counted_ptr_inplace<torch_xla::ReduceScatter, std::allocator<torch_xla::ReduceScatter>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x55b842a63510) at /usr/include/c++/10/bits/shared_ptr_base.h:560
#143 0x00007fa8c59aff65 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55b842a63510)
    at /usr/include/c++/10/bits/shared_ptr_base.h:158
#144 0x00007fa8c59ae92b in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x55b841437188, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:733
#145 0x00007fa8ca9efe54 in std::__shared_ptr<torch::lazy::Node, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x55b841437180, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1183
#146 0x00007fa8ca9efe70 in std::shared_ptr<torch::lazy::Node>::~shared_ptr (this=0x55b841437180, __in_chrg=<optimized out>)
    at /usr/include/c++/10/bits/shared_ptr.h:121
#147 0x00007fa8cb76e10b in std::_Destroy<std::shared_ptr<torch::lazy::Node> > (__pointer=0x55b841437180)
    at /usr/include/c++/10/bits/stl_construct.h:140
#148 0x00007fa8cb76d412 in std::_Destroy_aux<false>::__destroy<std::shared_ptr<torch::lazy::Node>*> (__first=0x55b841437180, 
    __last=0x55b841437190) at /usr/include/c++/10/bits/stl_construct.h:152
#149 0x00007fa8cb76c1f2 in std::_Destroy<std::shared_ptr<torch::lazy::Node>*> (__first=0x55b841437170, __last=0x55b841437190)
    at /usr/include/c++/10/bits/stl_construct.h:185
#150 0x00007fa8cb76b377 in std::_Destroy<std::shared_ptr<torch::lazy::Node>*, std::shared_ptr<torch::lazy::Node> > (
    __first=0x55b841437170, __last=0x55b841437190) at /usr/include/c++/10/bits/alloc_traits.h:738
#151 0x00007fa8cb76a639 in std::vector<std::shared_ptr<torch::lazy::Node>, std::allocator<std::shared_ptr<torch::lazy::Node> > >::~vector
    (this=0x55b842a239b8, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/stl_vector.h:680
#152 0x00007fa8cb768faa in torch::lazy::Node::~Node (this=0x55b842a23940, __in_chrg=<optimized out>)
    at /home/ubuntu/jthuynh/pytorch/torch/csrc/lazy/core/ir.cpp:108
#153 0x00007fa57f0c0016 in torch_xla::XlaNode::~XlaNode (this=0x55b842a23940, __in_chrg=<optimized out>) at torch_xla/csrc/ir.cpp:111
#154 0x00007fa57f07216e in torch_xla::ReduceScatter::~ReduceScatter (this=0x55b842a23940, __in_chrg=<optimized out>)
#155 0x00007fa57eb692f9 in __gnu_cxx::new_allocator<torch_xla::ReduceScatter>::destroy<torch_xla::ReduceScatter> (this=0x55b842a23940, 
    __p=0x55b842a23940) at /usr/include/c++/10/ext/new_allocator.h:156
#156 0x00007fa57eb67b1d in std::allocator_traits<std::allocator<torch_xla::ReduceScatter> >::destroy<torch_xla::ReduceScatter> (__a=..., 
    __p=0x55b842a23940) at /usr/include/c++/10/bits/alloc_traits.h:531
#157 0x00007fa57eb65a9f in std::_Sp_counted_ptr_inplace<torch_xla::ReduceScatter, std::allocator<torch_xla::ReduceScatter>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x55b842a23930) at /usr/include/c++/10/bits/shared_ptr_base.h:560
#158 0x00007fa57e8c444d in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55b842a23930)
    at /usr/include/c++/10/bits/shared_ptr_base.h:158
#159 0x00007fa57e8af613 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x55b842844ef8, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:733
#160 0x00007fa57e8a57de in std::__shared_ptr<torch::lazy::Node, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x55b842844ef0, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1183
#161 0x00007fa57e8a57fa in std::shared_ptr<torch::lazy::Node>::~shared_ptr (this=0x55b842844ef0, __in_chrg=<optimized out>)
    at /usr/include/c++/10/bits/shared_ptr.h:121
#162 0x00007fa57e8a6124 in torch::lazy::Value::~Value (this=0x55b842844ef0, __in_chrg=<optimized out>)
    at bazel-out/k8-dbg/bin/external/torch/_virtual_includes/headers/torch/csrc/lazy/core/ir.h:262
#163 0x00007fa57e8fec68 in __gnu_cxx::new_allocator<torch::lazy::Value>::destroy<torch::lazy::Value> (this=0x55b842844ef0, 
    __p=0x55b842844ef0) at /usr/include/c++/10/ext/new_allocator.h:156
#164 0x00007fa57e8e945a in std::allocator_traits<std::allocator<torch::lazy::Value> >::destroy<torch::lazy::Value> (__a=..., 
    __p=0x55b842844ef0) at /usr/include/c++/10/bits/alloc_traits.h:531
#165 0x00007fa57e93c67f in std::_Sp_counted_ptr_inplace<torch::lazy::Value, std::allocator<torch::lazy::Value>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x55b842844ee0) at /usr/include/c++/10/bits/shared_ptr_base.h:560
#166 0x00007fa57e8c444d in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55b842844ee0)
    at /usr/include/c++/10/bits/shared_ptr_base.h:158
#167 0x00007fa57e8af613 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x55b840c37d78, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:733
#168 0x00007fa57e8ad4bc in std::__shared_ptr<torch::lazy::Value, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x55b840c37d70, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1183
#169 0x00007fa57e8ad4d8 in std::shared_ptr<torch::lazy::Value>::~shared_ptr (this=0x55b840c37d70, __in_chrg=<optimized out>)
    at /usr/include/c++/10/bits/shared_ptr.h:121
#170 0x00007fa57ea2e058 in std::pair<long const, std::shared_ptr<torch::lazy::Value> >::~pair (this=0x55b840c37d68, 
    __in_chrg=<optimized out>) at /usr/include/c++/10/bits/stl_pair.h:211
#171 0x00007fa57ea2e078 in __gnu_cxx::new_allocator<std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::lazy::Value> >, false> >::destroy<std::pair<long const, std::shared_ptr<torch::lazy::Value> > > (
    this=0x7fa5a57eea00 <torch_xla::(anonymous namespace)::g_all_reduce_tokens>, __p=0x55b840c37d68)
    at /usr/include/c++/10/ext/new_allocator.h:156
#172 0x00007fa57ea2d4e1 in std::allocator_traits<std::allocator<std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::lazy::Value> >, false> > >::destroy<std::pair<long const, std::shared_ptr<torch::lazy::Value> > > (__a=..., __p=0x55b840c37d68)
    at /usr/include/c++/10/bits/alloc_traits.h:531
#173 0x00007fa57ea2c5f1 in std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::lazy::Value> >, false> > >::_M_deallocate_node (this=0x7fa5a57eea00 <torch_xla::(anonymous namespace)::g_all_reduce_tokens>, 
    __n=0x55b840c37d60) at /usr/include/c++/10/bits/hashtable_policy.h:2053
#174 0x00007fa57ea2b210 in std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::lazy::Value> >, false> > >::_M_deallocate_nodes (this=0x7fa5a57eea00 <torch_xla::(anonymous namespace)::g_all_reduce_tokens>, 
    __n=0x0) at /usr/include/c++/10/bits/hashtable_policy.h:2075
#175 0x00007fa57ea29cc8 in std::_Hashtable<long, std::pair<long const, std::shared_ptr<torch::lazy::Value> >, std::allocator<std::pair<long const, std::shared_ptr<torch::lazy::Value> > >, std::__detail::_Select1st, std::equal_to<long>, std::hash<long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::clear (this=0x7fa5a57eea00 <torch_xla::(anonymous namespace)::g_all_reduce_tokens>) at /usr/include/c++/10/bits/hashtable.h:2030
#176 0x00007fa57ea28c98 in std::_Hashtable<long, std::pair<long const, std::shared_ptr<torch::lazy::Value> >, std::allocator<std::pair<long const, std::shared_ptr<torch::lazy::Value> > >, std::__detail::_Select1st, std::equal_to<long>, std::hash<long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::~_Hashtable (this=0x7fa5a57eea00 <torch_xla::(anonymous namespace)::g_all_reduce_tokens>, __in_chrg=<optimized out>)
    at /usr/include/c++/10/bits/hashtable.h:1387
#177 0x00007fa57ea2ecfc in std::unordered_map<long, std::shared_ptr<torch::lazy::Value>, std::hash<long>, std::equal_to<long>, std::allocator<std::pair<long const, std::shared_ptr<torch::lazy::Value> > > >::~unordered_map (
    this=0x7fa5a57eea00 <torch_xla::(anonymous namespace)::g_all_reduce_tokens>, __in_chrg=<optimized out>)
    at /usr/include/c++/10/bits/unordered_map.h:102
#178 0x00007fa8dd72d4d7 in __run_exit_handlers (status=0, listp=0x7fa8dd8c0718 <__exit_funcs>, 
    run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#179 0x00007fa8dd72d67a in __GI_exit (status=<optimized out>) at exit.c:139
#180 0x00007fa8dd9d50fb in Py_Exit (sts=0) at Python/pylifecycle.c:2295
#181 0x00007fa8dd9d7300 in handle_system_exit () at Python/pythonrun.c:684
#182 0x00007fa8dd9d7499 in _PyErr_PrintEx (tstate=0x55b83a9f16f0, set_sys_last_vars=set_sys_last_vars@entry=1) at Python/pythonrun.c:694
#183 0x00007fa8dd9d779d in PyErr_PrintEx (set_sys_last_vars=set_sys_last_vars@entry=1) at Python/pythonrun.c:789
#184 0x00007fa8dd9d77a7 in PyErr_Print () at Python/pythonrun.c:795
#185 0x00007fa8dd9d6443 in pyrun_simple_file (flags=0x7ffda18a4cb8, closeit=<optimized out>, filename='test_zero1.py', 
    fp=<optimized out>) at Python/pythonrun.c:445
#186 PyRun_SimpleFileExFlags (fp=<optimized out>, filename=<optimized out>, closeit=<optimized out>, flags=0x7ffda18a4cb8)
    at Python/pythonrun.c:472
#187 0x00007fa8dda999f0 in pymain_run_file (cf=0x7ffda18a4cb8, config=0x55b83a9f0a80) at Modules/main.c:385
#188 pymain_run_python (exitcode=0x7ffda18a4cb0) at Modules/main.c:610
#189 Py_RunMain () at Modules/main.c:689
#190 0x00007fa8dda99609 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:743
#191 0x00007fa8dd715d0a in __libc_start_main (main=0x55b839e0c050 <main>, argc=2, argv=0x7ffda18a4ec8, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffda18a4eb8) at ../csu/libc-start.c:308
#192 0x000055b839e0c08a in _start ()

Jan 08 '24 18:01 jeffhataws

@alanwaketan will you take a look at this issue please?

Jan 08 '24 18:01 miladm

@alanwaketan will you take a look at this issue please?

Jeff was able to verify that the issue is fixed in ToT and then he is bisecting to figure out the commit to backport to 2.2.

Jan 08 '24 21:01 alanwaketan

I narrowed to this commit f9c12fc11bb487675515a717ef89ecf954fe539f which allows the updated test_zero1.py to pass on GPU. Let me cherry-pick to 2.2 to check that the test still passes.

Jan 09 '24 17:01 jeffhataws

Confirmed that cherry-picking this change into 2.2 fixes test_zero1.py on GPU.

Jan 09 '24 17:01 jeffhataws

Thanks @alanwaketan for debugging tips. @JackCaoG @miladm let's cherry-pick/backport https://github.com/pytorch/xla/commit/f9c12fc11bb487675515a717ef89ecf954fe539f and https://github.com/pytorch/xla/commit/a60f8e7c066086af50b677f097e3f1c6559d6918 into 2.2 to fix test_zero1 on GPU?

Jan 09 '24 17:01 jeffhataws

@jeffhataws sure, can you follow the same process of replying to https://github.com/pytorch/xla/issues/6036 and create the pr for back port? You can just

git checkout r2.2
git checkout your_new_branch
git cherry-pick your_commit
git push origin your_new_branch

then open the pr for backport, we will merge it once all test passed.

Jan 09 '24 18:01 JackCaoG

hi @jeffhataws , I have a PR (Fix global_device_count(), local_device_count() for single process on CUDA) and currently the test test_zero1 is failing with error: https://gist.github.com/vanbasten23/b65423f2fd9c9859c0d4ecd47e058cfa. So I tried to fix it by replacing https://github.com/pytorch/xla/blob/7fae2ab1c68a8eb35fd66768d6619304ef506d54/torch_xla/distributed/zero_redundancy_optimizer.py#L67 with self.global_world_size = xr.global_device_count(). Still, the test fails but with a different error:

root@xiowei-gpu-1:/ansible# PJRT_DEVICE=CUDA python pytorch/xla/test/test_zero1.py
F
======================================================================
FAIL: test_zero1 (__main__.XlaZeRO1Test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2674, in wrapper
    method(*args, **kwargs)
  File "pytorch/xla/test/test_zero1.py", line 40, in test_zero1
    self.assertEqual(s1['state'], s2['base_state'])
  File "/usr/local/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 3518, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: The values for attribute 'shape' do not match: torch.Size([8, 8]) != torch.Size([2, 8]).

The failure occurred for item [0]['momentum_buffer']

To execute this test, run the following from the base repo dir:
     python test/test_zero1.py -k test_zero1

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 0.572s

FAILED (failures=1)

Any idea on why the test fails? Thanks.

Jan 17 '24 19:01 vanbasten23

Thanks. Will take a look. In the meantime, could you check if CCops like allgather is working properly for this test in your setup?

Jan 17 '24 22:01 jeffhataws

Thanks. Will take a look. In the meantime, could you check if CCops like allgather is working properly for this test in your setup?

Yeah, the CCops tests are passing: # GPU_NUM_DEVICES=4 PJRT_DEVICE=CUDA python pytorch/xla/test/pjrt/test_runtime_gpu.py

Note, the error seems to be the same as the one on TPU https://github.com/pytorch/xla/pull/4648#issuecomment-1516731222

Jan 17 '24 23:01 vanbasten23

Can we try to reenable this test?

Feb 29 '24 17:02 JackCaoG

Fixed by https://github.com/pytorch/xla/pull/7132

Jun 24 '24 22:06 jeffhataws

xla xla copied to clipboard

Need to reenable ZeRO1 for GPU to enable coverage for reduce-scatter/all-gather

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

xla
xla copied to clipboard