cugraph [FEA] Set index to `_EDGE_ID_ ` and `_VERTEX_` for `_vertex_prop_dataframe` and `_edge_prop

[FEA] Set index to `_EDGE_ID_ ` and `_VERTEX_` for `_vertex_prop_dataframe` and `_edge_prop_dataframe` to make sampling faster

Open VibhuJawa opened this issue 2 years ago • 3 comments

Describe the solution you'd like and any additional context

We should set index to _EDGE_ID_ and _VERTEX_ for _vertex_prop_dataframe and _edge_prop_dataframe so that when we are fetching for sampling by ids we are fast.

Motivating Example where we see a 3x speed up for fetching a batch of 50k.

from cugraph.experimental import PropertyGraph
import numpy as np
import cudf

pg = PropertyGraph()
n_features = 100
n_rows = 10_000_000

df = cudf.DataFrame({'node_id':np.arange(n_rows)})
for feat_id in range(n_features):
    df[f'feat_{feat_id}'] = np.ones(n_rows)
pg.add_vertex_data(df,vertex_col_name='node_id')


node_ids_to_fetch = np.random.randint(100_000_000, size=50_000)

Without Index:

%%timeit
node_ids_df = cudf.DataFrame({'_VERTEX_':node_ids_to_fetch, 'input_order':np.arange(0,len(node_ids_to_fetch))})
fetched_df = node_ids_df.merge(pg._vertex_prop_dataframe, how='left')
fetched_df = fetched_df.sort_values(by='input_order')
len(fetched_df)

57.9 ms ± 8.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

With Index (3x faster) :

df_with_index =  pg._vertex_prop_dataframe.set_index('_VERTEX_')

%%timeit
fetched_df = df_with_index.loc[node_ids_to_fetch]

18.5 ms

Jul 11 '22 20:07 VibhuJawa

I'm seeing 10x speedup in my tests by setting the index and using .loc as shown here. Could we increase the priority of this?

Aug 05 '22 19:08 alexbarghi-nv

Wow, nice! Yup, I expect to start work on this today or Monday.

Aug 05 '22 20:08 eriknw

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Sep 17 '22 19:09 github-actions[bot]

cugraph cugraph copied to clipboard

[FEA] Set index to `_EDGE_ID_ ` and `_VERTEX_` for `_vertex_prop_dataframe` and `_edge_prop_dataframe` to make sampling faster

cugraph
cugraph copied to clipboard