torchdrug icon indicating copy to clipboard operation
torchdrug copied to clipboard

[Note] Dead lock when running `layers.GraphIsomorphismConv` (issue located in `sparse_coo_tensor`)

Open jasperhyp opened this issue 2 years ago • 0 comments

Hi! I finished writing this issue but did not submit it before I was able to resolve the issue. Thought it might still be helpful for people so I am keeping it.

I used Python 3.8 and PyTorch 1.12.1 and encountered infinite wait when running forward pass to GIN recently. It wasn't the case before, and I am unfortunately unsure what changes happened in my environment that potentially have led to this issue. Nevertheless, I was able to track the issue down to line 28-32 in utils.torch:

cpp_extension.load(self.name, self.sources, self.extra_cflags, self.extra_cuda_cflags,
                                  self.extra_ldflags, self.extra_include_paths, self.build_directory,
                                  self.verbose, **self.kwargs)

The GIN layer calls utils.sparse_coo_tensor (line 337-338 in layers.conv), which then calls torch_ext.sparse_coo_tensor_unsafe(indices, values, size) in line 185 in utils.torch, which then lead to the use of LazyExtensionLoader, and thus the above lines.

That function then tries to load the torch_ext.cpp file and jit compile it, however, during compilation, a deadlock is encountered in jit_compile. At this stage, the issue is clearer: It must be related to incomplete compilation that accidentally happened before. And so the solution is also straightforward: Delete ~/.cache/torch_extension/.

As a final note, it appeared that in PyTorch 1.13+, the performance issue of sparse_coo_tensor has been resolved (see this issue). I wonder if it is still necessary to use torch_ext anyway, and if it can be replaced by some built-in function potentially from torch_sparse.

jasperhyp avatar Nov 28 '23 21:11 jasperhyp