cugraph
cugraph copied to clipboard
[BUG] is_edge_ids in uniform_sampling does not seem to do anything
Describe the bug
When we pass is_edge_ids=True we dont seem to do anything with it in the uniform_neighbor_sample function (see below). We should triage this and fix it accordingly.
https://github.com/rapidsai/cugraph/blob/7d8f0fd63ad58ce6deada5508bfc08ee9aa46d36/python/cugraph/cugraph/sampling/uniform_neighbor_sample.py#L23-L50
MRE: EDIT: Cleaned up MRE to be easier to understand.
import cudf
import cugraph
import numpy as np
df = cudf.DataFrame({'source':[1,1],
'destination':[2,2],
'eids':[0,1]
})
df = df.astype(np.int32)
G = cugraph.Graph(directed=True)
G.from_cudf_edgelist(df,edge_attr='eids')
sample_df = cugraph.uniform_neighbor_sample(
G,
cudf.Series([1]),
fanout_vals=[2],
is_edge_ids=True
)
print(sample_df)
indices sources destinations
0 0 1 2
Expected behavior:
indices sources destinations
0 0 1 2
1 1 1 2
The C++ code operates only on vertices but I can always automatically get the start vertices from the provided edges. Also shouldn't the expected result be:
indices sources destinations
0 1 1 2
While we're addressing this, we should also update the docstring.
import cudf import cugraph import numpy as np df = cudf.DataFrame({'source':[1,1], 'destination':[2,2], 'eids':[0,1] }) df = df.astype(np.int32) G = cugraph.Graph(directed=True) G.from_cudf_edgelist(df,edge_attr='eids') sample_df = cugraph.uniform_neighbor_sample( G, cudf.Series([1]), fanout_vals=[2], is_edge_ids=True ) print(sample_df)
Nope, i want both the edges. Index 0 and Index 1. Maybe below example is easier to get my point across ?
The index_ids here are 10 and 11 and I do not renumber to be clearer.
import cudf
import cugraph
import numpy as np
df = cudf.DataFrame({'source':[1,1],
'destination':[2,2],
'eids':[10,11]
})
df = df.astype(np.int32)
G = cugraph.Graph(directed=True)
G.from_cudf_edgelist(df,edge_attr='eids', renumber=False, legacy_renum_only=False)
sample_df = cugraph.uniform_neighbor_sample(
G,
cudf.Series([1]).astype(np.int32),
fanout_vals=[2],
is_edge_ids=True
)
print(sample_df)
sources destinations indices
0 1 2 10
0 1 2 11
Ok, so you want the current behavior, but with the is_edge_ids flag specifying whether the weights are edge ids rather than always assuming they are, correct?
My interpretation was that if is_edge_ids was true, it should treat the start ids as edge ids and automatically select the vertices from those edges.
Ok, so you want the current behavior, but with the
is_edge_idsflag specifying whether the weights are edge ids rather than always assuming they are, correct?
Yup. Thats my understanding and then treating sources, destinations,edge_ids as unique identifiers so that they are sampled.
My interpretation was that if
is_edge_idswas true, it should treat the start ids as edge ids and automatically select the vertices from those edges.
Okay. Yup, thats a perfectly viable understanding too . I guess we really need a docstring here to clarify what this was supposed to do.
ok, so there are 2 problems here:
- You need to use
cugraph.MultiGraphsincecugraph.Graphwill automatically drop duplicate edges - Apparently it ignores the duplicate edge even when using a
cugraph.MultiGraphwhich is either a bug in plc graph creation oruniform_neighbor_sampleitself.
(2) is linked to the count_and_remove_duplicates call currently in the C++ code for uniform_neighbor_sample.
And I would say (1) is expected behavior that we won't fix.
(2) is linked to the
count_and_remove_duplicatescall currently in the C++ code foruniform_neighbor_sample. And I would say (1) is expected behavior that we won't fix.
Thanks for helping with the triage @alexbarghi-nv , I think we are now all in agreement.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.