torch-mlir icon indicating copy to clipboard operation
torch-mlir copied to clipboard

`fx.export_and_import` hangs

Open justin-ngo-arm opened this issue 8 months ago • 8 comments

I have a simple program:

class Conv2D(torch.nn.Module):

    def __init__(
        self,
        kernel_size=3,
        in_channels=8,
        out_channels=16,
        stride=1,
        padding=0,
        dilation=1,
        bias=True,
    ):
        super().__init__()
        self.conv = torch.nn.Conv2d(
            in_channels=in_channels,
            out_channels=out_channels,
            kernel_size=kernel_size,
            stride=stride,
            padding=padding,
            dilation=dilation,
            bias=bias,
        )

    def forward(self, x):
        return self.conv(x)

if __name__ == "__main__":
    model = Conv2D(
        kernel_size=(3, 3),
        in_channels=3,
        out_channels=8,
        stride=(1, 2),
        padding=(1, 1),
        dilation=(1, 1),
        bias=False,
    )
    model.eval()  # Set to evaluation mode
    example_input = torch.randn(2, 3, 5, 32, requires_grad=True, device="cpu")
    prog = torch.export.export(model, (example_input,))
    torch_module = fx.export_and_import(
        prog,
        func_name="temp",
        enable_graph_printing=False,
        import_symbolic_shape_expressions=True,
    )
    print(torch_module)

When I run it, it can generate the Torch-MLIR module like I want. However, when the program finished, it didn't exit cleanly but rather just hung there. I had to Ctrl+C to exit. I found the same thing happens to one of the examples - projects/pt1/examples/fximporter_resnet18.py (I've not checked other examples). I've tried running my program with a debugger, and it looks like at the end, some Python internal cleaning processes got stuck in a loop or something like that. I'm not entirely sure what causes that.

justin-ngo-arm avatar Apr 24 '25 00:04 justin-ngo-arm

@vivekkhandelwal1 Can you take a look at this please? Is this a known issue, and are there any fixes?

cc: @sjarus

justin-ngo-arm avatar Apr 24 '25 00:04 justin-ngo-arm

@justin-ngo-arm I reproduced the same issue while working with a MultiheadAttention layer model. I use fx.export_and_import() as follows and I can print 'mlir_model' but it still hangs afterwards.

mlir_model = fx.export_and_import(exported_model, output_type=OutputType.TORCH, experimental_support_mutation=True)
print(mlir_model)

alaa-ali avatar May 15 '25 18:05 alaa-ali

Hi, I can reproduce this as well. I tried attaching GDB to python when it was hung and I can see that it seems to be stuck during Python shutdown while trying to acquire the Python GIL. The stack I got was as below.

I am not really familiar with this, but it looks like we might be trying to acquire the GIL after Python shutdown has already proceeded past the point where this is safe to do?

CC @mfeliz-cruise

#0  0x00007ffff77be7d1 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007ffff7bba7cb in PyCOND_TIMEDWAIT (us=<optimized out>, mut=0x7ffff7fbdfb0 <_PyRuntime+432>, cond=0x7ffff7fbdf80 <_PyRuntime+384>) at Python/condvar.h:73
#2  take_gil (tstate=0x555555610b40) at Python/ceval_gil.h:247
#3  PyEval_RestoreThread (tstate=tstate@entry=0x555555610b40) at Python/ceval.c:467
#4  0x00007ffff7c482a1 in PyGILState_Ensure () at Python/pystate.c:1389
#5  0x00007ffe656c32ad in nanobind::gil_scoped_acquire::gil_scoped_acquire (this=<optimized out>) at external/nanobind/include/nanobind/nb_misc.h:15
#6  (anonymous namespace)::PyDenseResourceElementsAttribute::getFromBuffer((anonymous namespace)::nb_buffer, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mlir::python::PyType const&, std::optional<unsigned long>, bool, mlir::python::DefaultingPyMlirContext)::{lambda(void*, void const*, unsigned long, unsigned long)#1}::operator()(void*, void const*, unsigned long, unsigned long) const (userData=0x55555e1ecf50, this=<optimized out>, data=<optimized out>, size=<optimized out>, align=<optimized out>) at external/llvm-project/mlir/lib/Bindings/Python/IRAttributes.cpp:1480
#7  (anonymous namespace)::PyDenseResourceElementsAttribute::getFromBuffer((anonymous namespace)::nb_buffer, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mlir::python::PyType const&, std::optional<unsigned long>, bool, mlir::python::DefaultingPyMlirContext)::{lambda(void*, void const*, unsigned long, unsigned long)#1}::__invoke(void*, void const*, unsigned long, unsigned long) (userData=0x55555e1ecf50, data=<optimized out>, size=<optimized out>, align=<optimized out>) at external/llvm-project/mlir/lib/Bindings/Python/IRAttributes.cpp:1475
#8  0x00007ffe64875acf in llvm::unique_function<void(void*, unsigned long, unsigned long)>::operator() (this=0x55555f09b5f0, Params=140737345480657, Params=140737345480657, Params=140737345480657)
    at external/llvm-project/llvm/include/llvm/ADT/FunctionExtras.h:387
#9  mlir::AsmResourceBlob::~AsmResourceBlob (this=0x55555f09b5d8) at external/llvm-project/mlir/include/mlir/IR/AsmState.h:134
#10 0x00007ffe68868a69 in std::_Optional_payload_base<mlir::AsmResourceBlob>::_M_destroy (this=<optimized out>) at external/gcc11_4/x86_64-linux-gnu/include/c++/11.4.0/optional:260
#11 std::_Optional_payload_base<mlir::AsmResourceBlob>::_M_reset (this=0x7ffff7fbdfac <_PyRuntime+428>) at external/gcc11_4/x86_64-linux-gnu/include/c++/11.4.0/optional:280
#12 std::_Optional_payload<mlir::AsmResourceBlob, false, false, false>::~_Optional_payload (this=0x7ffff7fbdfac <_PyRuntime+428>) at external/gcc11_4/x86_64-linux-gnu/include/c++/11.4.0/optional:401
#13 std::_Optional_base<mlir::AsmResourceBlob, false, false>::~_Optional_base (this=0x7ffff7fbdfac <_PyRuntime+428>) at external/gcc11_4/x86_64-linux-gnu/include/c++/11.4.0/optional:472
#14 mlir::DialectResourceBlobManager::BlobEntry::~BlobEntry (this=<optimized out>) at external/llvm-project/mlir/include/mlir/IR/DialectResourceBlobManager.h:36
#15 llvm::StringMapEntryStorage<mlir::DialectResourceBlobManager::BlobEntry>::~StringMapEntryStorage (this=0x55555f09b5c0) at external/llvm-project/llvm/include/llvm/ADT/StringMapEntry.h:69
#16 llvm::StringMapEntry<mlir::DialectResourceBlobManager::BlobEntry>::Destroy<llvm::MallocAllocator> (this=0x55555f09b5c0, allocator=...) at external/llvm-project/llvm/include/llvm/ADT/StringMapEntry.h:143
#17 llvm::StringMap<mlir::DialectResourceBlobManager::BlobEntry, llvm::MallocAllocator>::~StringMap (this=0x55555e296b60) at external/llvm-project/llvm/include/llvm/ADT/StringMap.h:203
#18 0x00007ffe68868856 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55555e296b10) at external/gcc11_4/x86_64-linux-gnu/include/c++/11.4.0/bits/shared_ptr_base.h:168
#19 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=<optimized out>) at external/gcc11_4/x86_64-linux-gnu/include/c++/11.4.0/bits/shared_ptr_base.h:705
#20 std::__shared_ptr<mlir::DialectResourceBlobManager, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<optimized out>) at external/gcc11_4/x86_64-linux-gnu/include/c++/11.4.0/bits/shared_ptr_base.h:1154
#21 mlir::ResourceBlobManagerDialectInterface::~ResourceBlobManagerDialectInterface (this=0x55555e108430) at external/llvm-project/mlir/include/mlir/IR/DialectResourceBlobManager.h:112
#22 mlir::ResourceBlobManagerDialectInterfaceBase<mlir::DialectResourceBlobHandle<mlir::BuiltinDialect> >::~ResourceBlobManagerDialectInterfaceBase (this=0x55555e108430)
    at external/llvm-project/mlir/include/mlir/IR/DialectResourceBlobManager.h:141
#23 0x00007ffe6888c991 in std::default_delete<mlir::DialectInterface>::operator() (this=<optimized out>, __ptr=0x7ffff7fbdfac <_PyRuntime+428>)
    at external/gcc11_4/x86_64-linux-gnu/include/c++/11.4.0/bits/unique_ptr.h:85
#24 std::unique_ptr<mlir::DialectInterface, std::default_delete<mlir::DialectInterface> >::~unique_ptr (this=<optimized out>) at external/gcc11_4/x86_64-linux-gnu/include/c++/11.4.0/bits/unique_ptr.h:361
#25 llvm::DenseMapBase<llvm::DenseMap<mlir::TypeID, std::unique_ptr<mlir::DialectInterface, std::default_delete<mlir::DialectInterface> >, llvm::DenseMapInfo<mlir::TypeID, void>, llvm::detail::DenseMapPair<mlir::TypeID, std::unique_ptr<mlir::DialectInterface, std::default_delete<mlir::DialectInterface> > > >, mlir::TypeID, std::unique_ptr<mlir::DialectInterface, std::default_delete<mlir::DialectInterface> >, llvm::DenseMapInfo<mlir::TypeID, void>, llvm::detail::DenseMapPair<mlir::TypeID, std::unique_ptr<mlir::DialectInterface, std::default_delete<mlir::DialectInterface> > > >::destroyAll (this=<optimized out>)
    at external/llvm-project/llvm/include/llvm/ADT/DenseMap.h:385
#26 llvm::DenseMap<mlir::TypeID, std::unique_ptr<mlir::DialectInterface, std::default_delete<mlir::DialectInterface> >, llvm::DenseMapInfo<mlir::TypeID, void>, llvm::detail::DenseMapPair<mlir::TypeID, std::unique_ptr<mlir::DialectInterface, std::default_delete<mlir::DialectInterface> > > >::~DenseMap (this=<optimized out>) at external/llvm-project/llvm/include/llvm/ADT/DenseMap.h:771
#27 mlir::Dialect::~Dialect (this=0x55555e2bb560) at external/llvm-project/mlir/lib/IR/Dialect.cpp:43
#28 0x00007ffe68861b7e in mlir::BuiltinDialect::~BuiltinDialect (this=0x7ffff7fbdfac <_PyRuntime+428>) at bazel-out/k8-dbg/bin/external/llvm-project/mlir/include/mlir/IR/BuiltinDialect.cpp.inc:21
#29 0x00007ffe688cfea9 in std::default_delete<mlir::Dialect>::operator() (this=<optimized out>, __ptr=0x7ffff7fbdfac <_PyRuntime+428>)
    at external/gcc11_4/x86_64-linux-gnu/include/c++/11.4.0/bits/unique_ptr.h:85
#30 std::unique_ptr<mlir::Dialect, std::default_delete<mlir::Dialect> >::~unique_ptr (this=<optimized out>) at external/gcc11_4/x86_64-linux-gnu/include/c++/11.4.0/bits/unique_ptr.h:361
#31 llvm::DenseMapBase<llvm::DenseMap<llvm::StringRef, std::unique_ptr<mlir::Dialect, std::default_delete<mlir::Dialect> >, llvm::DenseMapInfo<llvm::StringRef, void>, llvm::detail::DenseMapPair<llvm::StringRef, std::unique_ptr<mlir::Dialect, std::default_delete<mlir::Dialect> > > >, llvm::StringRef, std::unique_ptr<mlir::Dialect, std::default_delete<mlir::Dialect> >, llvm::DenseMapInfo<llvm::StringRef, void>, llvm::detail::DenseMapPair<llvm::StringRef, std::unique_ptr<mlir::Dialect, std::default_delete<mlir::Dialect> > > >::destroyAll (this=0x55555e10ccd0) at external/llvm-project/llvm/include/llvm/ADT/DenseMap.h:385
#32 llvm::DenseMap<llvm::StringRef, std::unique_ptr<mlir::Dialect, std::default_delete<mlir::Dialect> >, llvm::DenseMapInfo<llvm::StringRef, void>, llvm::detail::DenseMapPair<llvm::StringRef, std::unique_ptr<mlir::Dialect, std::default_delete<mlir::Dialect> > > >::~DenseMap (this=0x55555e10ccd0) at external/llvm-project/llvm/include/llvm/ADT/DenseMap.h:771
#33 mlir::MLIRContextImpl::~MLIRContextImpl (this=0x55555e10cbd0) at external/llvm-project/mlir/lib/IR/MLIRContext.cpp:282
#34 0x00007ffe688c944a in std::default_delete<mlir::MLIRContextImpl>::operator() (this=0x55555e2b0600, __ptr=0x55555e10cbd0) at external/gcc11_4/x86_64-linux-gnu/include/c++/11.4.0/bits/unique_ptr.h:85
#35 std::unique_ptr<mlir::MLIRContextImpl, std::default_delete<mlir::MLIRContextImpl> >::~unique_ptr (this=0x55555e2b0600) at external/gcc11_4/x86_64-linux-gnu/include/c++/11.4.0/bits/unique_ptr.h:361
#36 mlir::MLIRContext::~MLIRContext (this=0x55555e2b0600) at external/llvm-project/mlir/lib/IR/MLIRContext.cpp:357
#37 0x00007ffe6718f123 in mlirContextDestroy (context=...) at external/llvm-project/mlir/lib/CAPI/IR/IR.cpp:71
#38 0x00007ffe656cd604 in mlir::python::PyMlirContext::~PyMlirContext (this=0x7ffe500761d8) at external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:667
#39 0x00007ffe6573e764 in nanobind::detail::inst_dealloc (self=0x7ffe500761c0) at external/nanobind/src/nb_type.cpp:241
#40 0x00007ffff7b84efa in subtype_dealloc (self=0x7ffe500761c0) at Objects/typeobject.c:1353
#41 0x00007ffe6572e484 in _Py_DECREF (op=0x7ffe500761c0) at external/python3_x86_64/include/python3.9/object.h:430
#42 nanobind::detail::decref_checked (o=0x7ffe500761c0) at external/nanobind/src/common.cpp:1084
#43 0x00007ffe6573e764 in nanobind::detail::inst_dealloc (self=0x7ffe500583a0) at external/nanobind/src/nb_type.cpp:241
#44 0x00007ffe6572e484 in _Py_DECREF (op=0x7ffe500583a0) at external/python3_x86_64/include/python3.9/object.h:430
#45 nanobind::detail::decref_checked (o=0x7ffe500583a0) at external/nanobind/src/common.cpp:1084
#46 0x00007ffe6573e764 in nanobind::detail::inst_dealloc (self=0x7ffe44763ef0) at external/nanobind/src/nb_type.cpp:241
#47 0x00007ffff7b6aa39 in _Py_DECREF (op=<optimized out>) at ./Include/object.h:430
#48 _Py_XDECREF (op=<optimized out>) at ./Include/object.h:497
#49 free_keys_object (keys=0x55555e145bb0) at Objects/dictobject.c:598
#50 dictkeys_decref (dk=0x55555e145bb0) at Objects/dictobject.c:333
#51 dict_dealloc (mp=0x7ffe50086480) at Objects/dictobject.c:2026
#52 0x00007ffff7b875a2 in _Py_DECREF (op=<optimized out>) at ./Include/object.h:430
#53 clear_slots (self=<optimized out>, type=0x55555b5f62e0) at Objects/typeobject.c:1160
#54 subtype_clear (self=0x7ffe54c1e660) at Objects/typeobject.c:1178
#55 0x00007ffff7c5130b in delete_garbage (gcstate=0x55555560d4e8, gcstate=0x55555560d4e8, old=0x55555560d530, collectable=0x7fffffff9170, tstate=0x555555610b40) at Modules/gcmodule.c:1004
#56 collect (tstate=tstate@entry=0x555555610b40, generation=generation@entry=2, n_collected=n_collected@entry=0x7fffffff9248, n_uncollectable=n_uncollectable@entry=0x7fffffff9250, nofail=nofail@entry=0)
    at Modules/gcmodule.c:1273
#57 0x00007ffff7c50d5b in collect_with_callback (tstate=tstate@entry=0x555555610b40, generation=generation@entry=2) at Modules/gcmodule.c:1387
#58 0x00007ffff7c51a8b in PyGC_Collect () at Modules/gcmodule.c:2075
#59 0x00007ffff7c51a15 in _PyGC_CollectIfEnabled () at Modules/gcmodule.c:2086
#60 0x00007ffff7c47676 in Py_FinalizeEx () at Python/pylifecycle.c:1423
#61 0x00007ffff7c5058d in Py_RunMain () at Modules/main.c:683
#62 0x00007ffff7c50299 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:735
#63 0x00007ffff77f6083 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#64 0x000055555540072a in _start ()

srinathava avatar May 20 '25 20:05 srinathava

A workaround seems to be to add:

mlir_module = None
import gc; gc.collect()

after you are done extracting the MLIR asm from mlir_module. So something like this works:

def test_resnet_export() -> None:
    model = resnet18(pretrained=False)
    input = torch.randn(10, 3, 224, 224)
    prog = torch.export.export(
        model,
        (input,),
        dynamic_shapes=None,
        strict=True,
    )
    mlir_module = fx.export_and_import(
        prog,
        import_symbolic_shape_expressions=True,
        func_name="resnet18",
    )
    mlir_asm = mlir_module.operation.get_asm(enable_debug_info=True)
    mlir_module = None
    import gc; gc.collect()
    assert len(mlir_asm) > 0

Credit to @matthewfl for the workaround.

srinathava avatar May 20 '25 21:05 srinathava

See https://github.com/llvm/llvm-project/pull/124832#issuecomment-3104402482 for explanation. Essentially, the destructor for dense resource is trying to acquire GIL during context release which in turn already had GIL taken. However, there is interpreter reinitialization in between that steals GIL and causes all kinds of issues

asl avatar Jul 22 '25 19:07 asl

I believe this LLVM commit has fixed this issue. @vivekkhandelwal1 can you help me bump LLVM in Torch-MLIR to get this change in place please? Thank you in advance!

justin-ngo-arm avatar Jul 28 '25 19:07 justin-ngo-arm

I believe this LLVM commit has fixed this issue. @vivekkhandelwal1 can you help me bump LLVM in Torch-MLIR to get this change in place please? Thank you in advance!

Hi @justin-ngo-arm, here's the PR to do the LLVM bump. https://github.com/llvm/torch-mlir/pull/4245

vivekkhandelwal1 avatar Aug 13 '25 13:08 vivekkhandelwal1

I believe this LLVM commit has fixed this issue. @vivekkhandelwal1 can you help me bump LLVM in Torch-MLIR to get this change in place please? Thank you in advance!

Hi @justin-ngo-arm, here's the PR to do the LLVM bump. https://github.com/llvm/torch-mlir/pull/4245

Thank you @vivekkhandelwal1 !

justin-ngo-arm avatar Aug 13 '25 15:08 justin-ngo-arm