pyg-lib
pyg-lib copied to clipboard
Debug segmentation fault
🐛 Describe the bug
Dear pyg-lib team,
I encountered an error when I call.
out = torch.ops.pyg.merge_sampler_outputs(
sampled_nodes_with_dupl,
edge_ids,
cumm_sampled_nbrs_per_node,
partition_ids,
partition_orders,
partitions_num,
one_hop_num,
src_batch,
self.disjoint,
)
the error is:
(TrainerActor pid=15637) *** SIGSEGV received at time=1724074372 ***
(TrainerActor pid=15637) PC: @ 0x110b277b0 (unknown) pyg::sampler::(anonymous namespace)::merge_outputs<>()
(TrainerActor pid=15637) @ 0x1080ecde0 (unknown) absl::lts_20230125::WriteFailureInfo()
(TrainerActor pid=15637) @ 0x1080ecb2c (unknown) absl::lts_20230125::AbslFailureSignalHandler()
(TrainerActor pid=15637) @ 0x190087584 (unknown) _sigtramp
(TrainerActor pid=15637) @ 0x110b27778 (unknown) pyg::sampler::(anonymous namespace)::merge_outputs<>()
(TrainerActor pid=15637) @ 0x110b29a80 (unknown) c10::impl::call_functor_with_args_from_stack_<>()
(TrainerActor pid=15637) @ 0x110b29958 (unknown) c10::impl::make_boxed_from_unboxed_functor<>::call()
(TrainerActor pid=15637) @ 0x383c8b4fc (unknown) torch::autograd::basicAutogradNotImplementedFallbackImpl()
(TrainerActor pid=15637) @ 0x38002e664 (unknown) c10::Dispatcher::callBoxed()
(TrainerActor pid=15637) @ 0x10bb8af14 (unknown) torch::jit::invokeOperatorFromPython()
(TrainerActor pid=15637) @ 0x10bb8b668 (unknown) torch::jit::_get_operation_for_overload_or_packet()
(TrainerActor pid=15637) @ 0x10bad43a8 (unknown) pybind11::detail::argument_loader<>::call<>()
(TrainerActor pid=15637) @ 0x10bad41f0 (unknown) pybind11::cpp_function::initialize<>()::{lambda()#1}::__invoke()
(TrainerActor pid=15637) @ 0x10b4a9b7c (unknown) pybind11::cpp_function::dispatcher()
(TrainerActor pid=15637) @ 0x104f82d88 (unknown) cfunction_call
(TrainerActor pid=15637) @ 0x104f32060 (unknown) _PyObject_Call
(TrainerActor pid=15637) @ 0x105026f20 (unknown) _PyEval_EvalFrameDefault
(TrainerActor pid=15637) @ 0x104f3116c (unknown) _PyObject_FastCallDictTstate
(TrainerActor pid=15637) @ 0x104f326a0 (unknown) _PyObject_Call_Prepend
(TrainerActor pid=15637) @ 0x104fa63c0 (unknown) slot_tp_call
(TrainerActor pid=15637) @ 0x104f31348 (unknown) _PyObject_MakeTpCall
(TrainerActor pid=15637) @ 0x105025580 (unknown) _PyEval_EvalFrameDefault
(TrainerActor pid=15637) @ 0x104f4af60 (unknown) gen_send_ex2
(TrainerActor pid=15637) @ 0x104cbcfac (unknown) task_step_impl
(TrainerActor pid=15637) @ 0x104cbcd84 (unknown) task_step
(TrainerActor pid=15637) @ 0x104f31348 (unknown) _PyObject_MakeTpCall
(TrainerActor pid=15637) @ 0x105044c04 (unknown) context_run
(TrainerActor pid=15637) @ 0x104f824b8 (unknown) cfunction_vectorcall_FASTCALL_KEYWORDS
(TrainerActor pid=15637) @ 0x105026f20 (unknown) _PyEval_EvalFrameDefault
(TrainerActor pid=15637) @ 0x104f34410 (unknown) method_vectorcall
(TrainerActor pid=15637) @ 0x105026f20 (unknown) _PyEval_EvalFrameDefault
(TrainerActor pid=15637) @ 0x104f34410 (unknown) method_vectorcall
(TrainerActor pid=15637) @ 0x1050f6558 (unknown) thread_run
(TrainerActor pid=15637) @ ... and at least 3 more frames
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440: *** SIGSEGV received at time=1724074372 ***
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440: PC: @ 0x110b277b0 (unknown) pyg::sampler::(anonymous namespace)::merge_outputs<>()
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440: @ 0x1080ecde0 (unknown) absl::lts_20230125::WriteFailureInfo()
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440: @ 0x1080ecb44 (unknown) absl::lts_20230125::AbslFailureSignalHandler()
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440: @ 0x190087584 (unknown) _sigtramp
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440: @ 0x110b27778 (unknown) pyg::sampler::(anonymous namespace)::merge_outputs<>()
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440: @ 0x110b29a80 (unknown) c10::impl::call_functor_with_args_from_stack_<>()
(TrainerActor pid=15637) [2024-08-19 15:32:52,527 E 15637 51205440] logging.cc:440: @ 0x110b29958 (unknown) c10::impl::make_boxed_from_unboxed_functor<>::call()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x383c8b4fc (unknown) torch::autograd::basicAutogradNotImplementedFallbackImpl()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x38002e664 (unknown) c10::Dispatcher::callBoxed()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x10bb8af14 (unknown) torch::jit::invokeOperatorFromPython()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x10bb8b668 (unknown) torch::jit::_get_operation_for_overload_or_packet()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x10bad43a8 (unknown) pybind11::detail::argument_loader<>::call<>()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x10bad41f0 (unknown) pybind11::cpp_function::initialize<>()::{lambda()#1}::__invoke()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x10b4a9b7c (unknown) pybind11::cpp_function::dispatcher()
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x104f82d88 (unknown) cfunction_call
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x104f32060 (unknown) _PyObject_Call
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x105026f20 (unknown) _PyEval_EvalFrameDefault
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x104f3116c (unknown) _PyObject_FastCallDictTstate
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x104f326a0 (unknown) _PyObject_Call_Prepend
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x104fa63c0 (unknown) slot_tp_call
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x104f31348 (unknown) _PyObject_MakeTpCall
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x105025580 (unknown) _PyEval_EvalFrameDefault
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x104f4af60 (unknown) gen_send_ex2
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x104cbcfac (unknown) task_step_impl
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x104cbcd84 (unknown) task_step
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x104f31348 (unknown) _PyObject_MakeTpCall
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x105044c04 (unknown) context_run
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x104f824b8 (unknown) cfunction_vectorcall_FASTCALL_KEYWORDS
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x105026f20 (unknown) _PyEval_EvalFrameDefault
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x104f34410 (unknown) method_vectorcall
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x105026f20 (unknown) _PyEval_EvalFrameDefault
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x104f34410 (unknown) method_vectorcall
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ 0x1050f6558 (unknown) thread_run
(TrainerActor pid=15637) [2024-08-19 15:32:52,528 E 15637 51205440] logging.cc:440: @ ... and at least 3 more frames
(TrainerActor pid=15637) Fatal Python error: Segmentation fault
(TrainerActor pid=15637)
(TrainerActor pid=15637) Stack (most recent call first):
(TrainerActor pid=15637) File "/Users/mdenadai/.local/share/virtualenvs/mine-ec3snymA/lib/python3.12/site-packages/torch/_ops.py", line 854 in __call__
(TrainerActor pid=15637) File "/Users/mdenadai/research/mine/pytorch_geometric-master/torch_geometric/distributed/dist_neighbor_sampler.py", line 842 in _merge_sampler_outputs
(TrainerActor pid=15637) File "/Users/mdenadai/research/mine/pytorch_geometric-master/torch_geometric/distributed/dist_neighbor_sampler.py", line 948 in sample_one_hop
(TrainerActor pid=15637) File "/Users/mdenadai/research/mine/pytorch_geometric-master/torch_geometric/distributed/dist_neighbor_sampler.py", line 315 in node_sample
(TrainerActor pid=15637) File "/Users/mdenadai/research/mine/pytorch_geometric-master/torch_geometric/distributed/dist_neighbor_sampler.py", line 618 in edge_sample
(TrainerActor pid=15637) File "/Users/mdenadai/research/mine/pytorch_geometric-master/torch_geometric/distributed/dist_neighbor_sampler.py", line 193 in _sample_from
(TrainerActor pid=15637) File "/opt/homebrew/Cellar/[email protected]/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/events.py", line 88 in _run
(TrainerActor pid=15637) File "/opt/homebrew/Cellar/[email protected]/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/base_events.py", line 1987 in _run_once
(TrainerActor pid=15637) File "/opt/homebrew/Cellar/[email protected]/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/base_events.py", line 641 in run_forever
(TrainerActor pid=15637) File "/Users/mdenadai/research/mine/pytorch_geometric-master/torch_geometric/distributed/event_loop.py", line 108 in _run_loop
(TrainerActor pid=15637) File "/opt/homebrew/Cellar/[email protected]/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1010 in run
(TrainerActor pid=15637) File "/opt/homebrew/Cellar/[email protected]/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
(TrainerActor pid=15637) File "/opt/homebrew/Cellar/[email protected]/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1030 in _bootstrap
(TrainerActor pid=15637)
(TrainerActor pid=15637) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_osx, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, ray._raylet, numpy._core._multiarray_umath, numpy._core._multiarray_tests, numpy.linalg._umath_linalg, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, scipy._lib._ccallback_c, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.cluster._vq, scipy.cluster._hierarchy, scipy.cluster._optimal_leaf_ordering, markupsafe._speedups, pyarrow.lib, pyarrow._json (total: 74)
do you have a suggestion on how to debug this?
thx
Environment
pyg-libversion:- PyTorch version:
- OS:
- Python version:
- CUDA/cuDNN version:
- How you installed PyTorch and
pyg-lib(conda,pip, source): - Any other relevant information:
PS: I wonder whether this works https://github.com/pyg-team/pyg-lib/blob/95aeaaaccebc2317d2a6de4cbdf903c15a3541a8/pyg_lib/csrc/sampler/cpu/dist_merge_outputs_kernel.cpp#L90 when we have empty lists https://github.com/pyg-team/pytorch_geometric/blob/f7ed25ded654bc89f3bfc649b6caffead5b49a6b/torch_geometric/distributed/dist_neighbor_sampler.py#L830 @rusty1s
I converted it in a non-optimized plain python version and partially fixed it (not unit tested):
def merge_outputs(
node_ids: List[torch.Tensor],
edge_ids: List[torch.Tensor],
cumsum_neighbors_per_node: List[List[int]],
partition_ids: List[int],
partition_orders: List[int],
num_partitions: int,
num_neighbors: int,
batch: Optional[torch.Tensor] = None,
disjoint: bool = False
) -> Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor], List[int]]:
if num_neighbors < 0:
# Find maximum population
population = [[] for _ in range(num_partitions)]
max_populations = [0] * num_partitions
for p_id in range(num_partitions):
cumsum1 = cumsum_neighbors_per_node[p_id][1:]
cumsum2 = cumsum_neighbors_per_node[p_id][:-1]
population[p_id] = [abs(a - b) for a, b in zip(cumsum1, cumsum2)]
max_populations[p_id] = max(population[p_id])
offset = max(max_populations)
else:
offset = num_neighbors
p_size = len(partition_ids)
sampled_neighbors_per_node = [0] * p_size
sampled_node_ids = torch.full((p_size * offset,), -1, dtype=node_ids[0].dtype)
sampled_edge_ids = torch.full((p_size * offset,), -1, dtype=edge_ids[0].dtype)
sampled_batch = torch.full((p_size * offset,), -1, dtype=batch.dtype) if disjoint else None
sampled_node_ids_vec = [n.tolist() for n in node_ids]
sampled_edge_ids_vec = [e.tolist() for e in edge_ids]
#print("cumsum_neighbors_per_node", cumsum_neighbors_per_node, "partition_ids", partition_ids)
for j in range(p_size):
p_id = partition_ids[j]
p_order = partition_orders[j]
if not cumsum_neighbors_per_node[p_id] or len(cumsum_neighbors_per_node[p_id]) <= p_order+1:
continue
#print("cumsum_neighbors_per_node", len(cumsum_neighbors_per_node[p_id]) <= p_order+1, p_order, len(cumsum_neighbors_per_node[p_id]))
begin_node = cumsum_neighbors_per_node[p_id][p_order]
begin_edge = begin_node - cumsum_neighbors_per_node[p_id][0]
end_node = cumsum_neighbors_per_node[p_id][p_order + 1]
end_edge = end_node - cumsum_neighbors_per_node[p_id][0]
sampled_node_ids[j * offset:(j * offset + end_node - begin_node)] = torch.tensor(sampled_node_ids_vec[p_id][begin_node:end_node])
sampled_edge_ids[j * offset:(j * offset + end_edge - begin_edge)] = torch.tensor(sampled_edge_ids_vec[p_id][begin_edge:end_edge])
if disjoint:
sampled_batch[j * offset:(j * offset + end_node - begin_node)] = batch[j]
sampled_neighbors_per_node[j] = end_node - begin_node
# Remove auxiliary -1 numbers:
valid_node_indices = sampled_node_ids != -1
out_node_id = sampled_node_ids[valid_node_indices]
valid_edge_indices = sampled_edge_ids != -1
out_edge_id = sampled_edge_ids[valid_edge_indices]
out_batch = sampled_batch[valid_node_indices] if disjoint else None
return out_node_id, out_edge_id, out_batch, sampled_neighbors_per_node
Sorry for late reply. This looks indeed wrong. Can you share your inputs that make it crash? Also @kgajdamo for visibility.
No worries! I used movielens with 4 partitions and the code that was released. No modifications