torchdrug
torchdrug copied to clipboard
[Note] Dead lock when running `layers.GraphIsomorphismConv` (issue located in `sparse_coo_tensor`)
Hi! I finished writing this issue but did not submit it before I was able to resolve the issue. Thought it might still be helpful for people so I am keeping it.
I used Python 3.8 and PyTorch 1.12.1 and encountered infinite wait when running forward pass to GIN recently. It wasn't the case before, and I am unfortunately unsure what changes happened in my environment that potentially have led to this issue. Nevertheless, I was able to track the issue down to line 28-32 in utils.torch:
cpp_extension.load(self.name, self.sources, self.extra_cflags, self.extra_cuda_cflags,
self.extra_ldflags, self.extra_include_paths, self.build_directory,
self.verbose, **self.kwargs)
The GIN layer calls utils.sparse_coo_tensor (line 337-338 in layers.conv), which then calls torch_ext.sparse_coo_tensor_unsafe(indices, values, size) in line 185 in utils.torch, which then lead to the use of LazyExtensionLoader, and thus the above lines.
That function then tries to load the torch_ext.cpp file and jit compile it, however, during compilation, a deadlock is encountered in jit_compile. At this stage, the issue is clearer: It must be related to incomplete compilation that accidentally happened before. And so the solution is also straightforward: Delete ~/.cache/torch_extension/.
As a final note, it appeared that in PyTorch 1.13+, the performance issue of sparse_coo_tensor has been resolved (see this issue). I wonder if it is still necessary to use torch_ext anyway, and if it can be replaced by some built-in function potentially from torch_sparse.