cugraph
cugraph copied to clipboard
[FEA] Set index to `_EDGE_ID_ ` and `_VERTEX_` for `_vertex_prop_dataframe` and `_edge_prop_dataframe` to make sampling faster
Describe the solution you'd like and any additional context
We should set index to _EDGE_ID_
and _VERTEX_
for _vertex_prop_dataframe
and _edge_prop_dataframe
so that when we are fetching for sampling by ids
we are fast.
Motivating Example where we see a 3x speed up for fetching a batch of 50k
.
from cugraph.experimental import PropertyGraph
import numpy as np
import cudf
pg = PropertyGraph()
n_features = 100
n_rows = 10_000_000
df = cudf.DataFrame({'node_id':np.arange(n_rows)})
for feat_id in range(n_features):
df[f'feat_{feat_id}'] = np.ones(n_rows)
pg.add_vertex_data(df,vertex_col_name='node_id')
node_ids_to_fetch = np.random.randint(100_000_000, size=50_000)
Without Index:
%%timeit
node_ids_df = cudf.DataFrame({'_VERTEX_':node_ids_to_fetch, 'input_order':np.arange(0,len(node_ids_to_fetch))})
fetched_df = node_ids_df.merge(pg._vertex_prop_dataframe, how='left')
fetched_df = fetched_df.sort_values(by='input_order')
len(fetched_df)
57.9 ms ± 8.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
With Index (3x faster) :
df_with_index = pg._vertex_prop_dataframe.set_index('_VERTEX_')
%%timeit
fetched_df = df_with_index.loc[node_ids_to_fetch]
18.5 ms
I'm seeing 10x speedup in my tests by setting the index and using .loc
as shown here. Could we increase the priority of this?
Wow, nice! Yup, I expect to start work on this today or Monday.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.