dgl icon indicating copy to clipboard operation
dgl copied to clipboard

[GraphBolt] Update `ItemSampler`

Open Skeleton003 opened this issue 1 year ago • 8 comments

Description

  1. Update ItemSampler to support correct stochastic sharding across distributed groups.
  2. Modify the logic of ItemSet.__getitem__() when index is an iterable of int.

Benchmark: https://docs.google.com/document/d/1Pzk2PJoFtTZSu17wTXVK4mqvfrMLAj2xK6fcGC1pwEg/edit?usp=sharing

Checklist

Please feel free to remove inapplicable items for your PR.

  • [ ] The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • [ ] I've leverage the tools to beautify the python and c++ code.
  • [ ] The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • [ ] All changes have test coverage
  • [ ] Code is well-documented
  • [ ] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • [ ] Related issue is referred in this PR
  • [ ] If the PR is for a new model/paper, I've updated the example index here.

Changes

Skeleton003 avatar May 15 '24 06:05 Skeleton003

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch]; For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

dgl-bot avatar May 15 '24 06:05 dgl-bot

The fact that runtime performance is unchanged is good. However, to verify whether the old or the new implementation is more performant, we need to track the CPU utilization. Since ItemSampler and rest of the sampling pipeline runs concurrently, runtime is not enough information to determine that.

mfbalin avatar May 15 '24 06:05 mfbalin

Commit ID: 0134ddf809e73c871f5aacbe4cf9576544531f81

Build ID: 1

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot avatar May 15 '24 06:05 dgl-bot

The fact that runtime performance is unchanged is good. However, to verify whether the old or the new implementation is more performant, we need to track the CPU utilization. Since ItemSampler and rest of the sampling pipeline runs concurrently, runtime is not enough information to determine that.

Sure, could u please give me some guide on how to do that?

Skeleton003 avatar May 15 '24 07:05 Skeleton003

The fact that runtime performance is unchanged is good. However, to verify whether the old or the new implementation is more performant, we need to track the CPU utilization. Since ItemSampler and rest of the sampling pipeline runs concurrently, runtime is not enough information to determine that.

Sure, could u please give me some guide on how to do that?

The simplest way to monitor htop -d5 or nvtop to see what is the CPU utilization. If you wanted to be more precise, you could also insert timing code into ItemSampler to see how long each call takes. It may also make sense to write a regression benchmark so that we can monitor its runtime. We can have a sampling pipeline only containing the ItemSampler and benchmark it.

mfbalin avatar May 15 '24 07:05 mfbalin

Commit ID: 428d47a3b483c1e7f9b403124a56763afbac038b

Build ID: 2

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot avatar May 16 '24 05:05 dgl-bot

Commit ID: fa3ccb337322150991a773951da7f74ed897e8a7

Build ID: 3

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot avatar May 16 '24 08:05 dgl-bot

Commit ID: bacdfb1c6088d51474c3581e84ef998da9d6185b

Build ID: 4

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot avatar May 17 '24 04:05 dgl-bot

@mfbalin Thank you for your valuable suggestions, but considering the scope of this PR, I'd like to defer them to another PR.

@Rhett-Ying If everything looks good to you, please approve so I can work on.

Skeleton003 avatar May 20 '24 05:05 Skeleton003

@Skeleton003 According to the doc, examples/sampling/graphbolt/link_prediction.py runs a bit slower. Please keep an eye on the regression results.

Rhett-Ying avatar May 21 '24 00:05 Rhett-Ying

Commit ID: 1d69f6b7b4119f7cefb5801ba4ea01c94c573a4f

Build ID: 5

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot avatar May 21 '24 04:05 dgl-bot

Commit ID: 4987175ffa9db59998db8797ced119ee91c8d946

Build ID: 6

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot avatar May 21 '24 04:05 dgl-bot

Commit ID: a4b3f08b42d47b5105adaa6190a7a14593bc9ca6

Build ID: 7

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot avatar May 21 '24 05:05 dgl-bot