cugraph icon indicating copy to clipboard operation
cugraph copied to clipboard

Use category dtype for type in PropertyGraph

Open eriknw opened this issue 2 years ago • 8 comments

Closes #2399

This is ~not yet~ now implemented for MG.

eriknw avatar Aug 04 '22 20:08 eriknw

Codecov Report

Base: 59.94% // Head: 59.45% // Decreases project coverage by -0.49% :warning:

Coverage data is based on head (a9b6681) compared to base (94bc5ba). Patch coverage: 43.58% of modified lines in pull request are covered.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.10    #2510      +/-   ##
================================================
- Coverage         59.94%   59.45%   -0.50%     
================================================
  Files               111      111              
  Lines              6219     6279      +60     
================================================
+ Hits               3728     3733       +5     
- Misses             2491     2546      +55     
Impacted Files Coverage Δ
...ugraph/cugraph/dask/structure/mg_property_graph.py 13.54% <6.38%> (-0.40%) :arrow_down:
python/cugraph/cugraph/structure/property_graph.py 97.21% <100.00%> (+0.16%) :arrow_up:
...h/cugraph/gnn/pyg_extensions/data/cugraph_store.py 72.72% <0.00%> (-8.09%) :arrow_down:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov-commenter avatar Aug 04 '22 22:08 codecov-commenter

We should hold off on MG until the two cudf issues above are fixed. I can push MG changes to a different branch while we wait.

This PR also opens up an object model possibility. Currently, we can get the type names of a property graph via pg.vertex_types and pg.edge_types. These are inferred from the data, so there must be at least one vertex or edge with each type. What if the user knows upfront what most or all of the types will be, and wants to set these values before the data is loaded? This information would let us be more efficient with categorical dtypes for the type column, because there would be less of a need to add categories to the categorical dtype as new data comes in. So, should we allow users to add types to the property graph? This allows types to exist that have no data yet. I'm inclined to say "not yet".

eriknw avatar Aug 08 '22 02:08 eriknw

So, should we allow users to add types to the property graph? This allows types to exist that have no data yet. I'm inclined to say "not yet

I think "not yet" is the correct assumption. I think the main focus is to make the PGs efficient (both memory and computation) once the graph is created and not optimize the graph creation process.

VibhuJawa avatar Aug 08 '22 18:08 VibhuJawa

rerun tests

rlratzel avatar Sep 19 '22 14:09 rlratzel

👍

alexbarghi-nv avatar Sep 21 '22 02:09 alexbarghi-nv

rerun tests

BradReesWork avatar Sep 22 '22 14:09 BradReesWork

rerun tests

BradReesWork avatar Sep 23 '22 16:09 BradReesWork

Using categoricals for the type changes the behavior of renumber_edges_by_type and renumber_vertices_by_type. Previously, we would sort using on the name of the type. Now, we sort on the categorical value of the type, which is not alphabetical.

This probably isn't a problem, since we give the start and end for each type, and otherwise the behavior is well-behaved and correct.

eriknw avatar Sep 23 '22 20:09 eriknw

rerun tests

BradReesWork avatar Sep 27 '22 15:09 BradReesWork

rerun tests

BradReesWork avatar Sep 30 '22 14:09 BradReesWork

rerun tests

BradReesWork avatar Sep 30 '22 23:09 BradReesWork

@gpucibot merge

BradReesWork avatar Oct 03 '22 13:10 BradReesWork