dgl icon indicating copy to clipboard operation
dgl copied to clipboard

maybe a potential bug of neighbor sampling of distributed dgl heterogeneous graph

Open yfismine opened this issue 1 year ago • 7 comments

🐛 Bug

To Reproduce

According to my understanding, this may be a potential bug in the distributed dgl. There is such a code for neighbor sampling in the CSRRowWisePerEtypePick function in the rowwise_pick.h file to determine the type of an edge. image This function works normally when all edges are the inner edges of this slice, but for the outer edges, it is possible to trigger the following assertion error. Let me give you an example. Now local_etype_offset is [0,5,10] and fanout is [1,1]. If the point I sample is the internal point of this partition, but the only edge that exists at this point is the external edge, because this edge is the external edge, its eid is likely to be greater than 10. At this time, we calculate that the heterogenized_etype of this outer edge is 2, but when we enter the following assertion, we will prompt the error prompt of et [et _ idx [len-1]] < num _ etypes (2vs2) etypevalues exceeding the number of fanouts.

Environment

  • DGL Version (e.g., 1.0): 2.1
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): 2.3.0
  • OS (e.g., Linux): Linux
  • How you installed DGL (conda, pip, source): conda
  • Python version: 3.12.3

yfismine avatar Jun 23 '24 11:06 yfismine

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

github-actions[bot] avatar Jul 24 '24 01:07 github-actions[bot]

I don't think it's a bug.

During partition, we make sure all the edges of inner_node are partitioned into current partition. and these edges are called inner edges.

Rhett-Ying avatar Jul 25 '24 01:07 Rhett-Ying

I don't think it's a bug.

During partition, we make sure all the edges of inner_node are partitioned into current partition. and these edges are called inner edges.

It means that both the edge in and the edge out of the inner_node are on the same partition, and if there are features there, are they all stored in multiple partitions?

yfismine avatar Jul 31 '24 03:07 yfismine

I used the provided pairtion_graph function to test the segmented subgraph. It is very easy to find that some points are inner_node, but not all the edges directly connected with it are inner_edge. I found that if this point is inner_node, all its in_edges seem to be all inner_edge. But comparing out_edge is not necessarily true. I think your description may be inaccurate. The reason why this has not been wrong is because we usually use Incoming Edge mode when defining neighbor samplers, so there has been no error. @Rhett-Ying

yfismine avatar Aug 04 '24 10:08 yfismine

I don't think it's a bug. During partition, we make sure all the edges of inner_node are partitioned into current partition. and these edges are called inner edges.

It means that both the edge in and the edge out of the inner_node are on the same partition, and if there are features there, are they all stored in multiple partitions?

yes.

Rhett-Ying avatar Aug 08 '24 02:08 Rhett-Ying

Let me clarify more.

  1. inner nodes means they belong to current partition. These nodes are inner_nodes=True ones. Node features are partitioned and saved according to inner_nodes.
  2. any in-edges of inner nodes are marked as inner_edges. These edges are inner_edge=True ones. Edge features are partitioned and saved according to inner_edges.
  3. As we save all in-edges of inner_nodes, we may include some nodes that don't belong on current partition. These nodes are inner_nodes=False. Node feature of these nodes are NOT saved in the feature data of current partition.
  4. In order to obtain the out_degree of inner nodes, any out-edges of inner_nodes are also saved in current partition. These edges are inner_edge=False ones. Their edge features are not saved in current partition.

Rhett-Ying avatar Aug 08 '24 02:08 Rhett-Ying

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

github-actions[bot] avatar Sep 08 '24 01:09 github-actions[bot]