dgl icon indicating copy to clipboard operation
dgl copied to clipboard

[Graphbolt][Performance] Reduce the memory usage of `preprocess_ondisk_dataset`

Open czkkkkkk opened this issue 1 year ago • 2 comments

🚀 Feature

Motivation

Currently, preprocess_ondisk_dataset consumes much more memory than the topology of a graph itself during the preprocessing. When loading a graph with 2B nodes and 8B edges, it cannot be finished in a machine with 380 GB memory. After a rough profiling, I found that the peak memory usage is reached when converting a DGL graph to a fused sampling graph. https://github.com/dmlc/dgl/blob/4ee0a8bddbd93963b5f078c475381f4ab521d2e1/python/dgl/graphbolt/impl/ondisk_dataset.py#L212 There could be two factors contributing to the peak memory usage.

  1. The input DGL graph is passed to the function, which consumes about 160 GB memory.
  2. from_dglgraph creates a temporary homogeneous graph and also its CSC format.

Alternatives

Pitch

Additional context

czkkkkkk avatar Feb 05 '24 08:02 czkkkkkk

@Skeleton003 could you look into it and try with the new implementation: https://github.com/dmlc/dgl/pull/6986

Rhett-Ying avatar Feb 05 '24 09:02 Rhett-Ying

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

github-actions[bot] avatar Mar 07 '24 01:03 github-actions[bot]