dgl icon indicating copy to clipboard operation
dgl copied to clipboard

[Sampling] Sampling with edge masks

Open BarclayII opened this issue 3 years ago • 16 comments

Description

This PR extends the prob argument of sample_neighbors to accept a boolean tensor to represent which edges are eligible for sampling. It also adds a mask argument to dgl.dataloading.NeighborSampler for the same purpose.

g.edata['mask'] = torch.BoolTensor(...)
sampler = NeighborSampler([5, 10, 15], mask='mask')

I refrained from adding a mask argument in sample_neighbors function itself because that means it could change the interface of GraphStorage specification, and I'm not sure what is the workflow about it.

Also included a fix for the issue that if -1 is specified as fanout and some of the edges have probability 0, they can still be selected in the output.

Closes #4354 . Also closes #4441 . Replaces #4621.

Checklist

Please feel free to remove inapplicable items for your PR.

  • [x] The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • [x] Changes are complete (i.e. I finished coding on this PR)
  • [x] All changes have test coverage
  • [x] Code is well-documented
  • [x] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • [x] Related issue is referred in this PR

BarclayII avatar Sep 24 '22 09:09 BarclayII

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch]; For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

dgl-bot avatar Sep 24 '22 09:09 dgl-bot

Commit ID: 9d2257c3ab6538b68a84ff3d9c174eb009418597

Build ID: 1

Status: ❌ CI test failed in Stage [Torch GPU Unit test].

Report path: link

Full logs path: link

dgl-bot avatar Sep 24 '22 09:09 dgl-bot

Commit ID: fccb57fcc366960cba334ca4e642ff84af59218e

Build ID: 2

Status: ❌ CI test failed in Stage [C++ CPU].

Report path: link

Full logs path: link

dgl-bot avatar Sep 24 '22 11:09 dgl-bot

Commit ID: 6489cf151fd909c860a384444861db52a32f7daf

Build ID: 3

Status: ❌ CI test failed in Stage [Torch GPU Unit test].

Report path: link

Full logs path: link

dgl-bot avatar Sep 24 '22 12:09 dgl-bot

Commit ID: 331fec111d0f3a6664c2a4cf4a0a1f93294a6d4c

Build ID: 4

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Sep 24 '22 13:09 dgl-bot

Commit ID: 4c19bea68e6fb6ae3fd33ee3ebb0272c156b486e

Build ID: 5

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Sep 27 '22 13:09 dgl-bot

I realized that the local partition graph does not contain any node data and edge data of that partition. Instead, they are loaded into the KVStore instead: see https://github.com/dmlc/dgl/blob/master/python/dgl/distributed/dist_graph.py#L335-L336 and https://github.com/dmlc/dgl/blob/master/python/dgl/distributed/dist_graph.py#L375

Does it make sense to put the features inside the local partition DGLGraph (i.e. self.client_g) after they are loaded? @Rhett-Ying

BarclayII avatar Sep 27 '22 16:09 BarclayII

Commit ID: 75319e1fb5baebea253da6e8664e3a86f65dcaff

Build ID: 6

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot avatar Sep 29 '22 17:09 dgl-bot

Turns out that the workload is much larger than I initially anticipated.

The original sample_etype_neighbors interface assumed that the probabilities of all edge types are stored in the same single tensor, which is not the case for distributed training. I reserved a week as a buffer for possible issues and then spent a week changing the interface.

I only tested single machine so far. Will do more rigorous testing on multiple machine setting in the next cycle.

BarclayII avatar Sep 29 '22 17:09 BarclayII

Commit ID: ef821ff110f2a5c21d0bc02db3b668f2c94e4e8e

Build ID: 7

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot avatar Sep 29 '22 17:09 dgl-bot

Commit ID: b740050eb66e853530731968d7a38b00cee3ebfe

Build ID: 8

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

dgl-bot avatar Sep 29 '22 18:09 dgl-bot

Commit ID: 9109b98cb2849ff84d18b63df21a17eb23cb4c63

Build ID: 9

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Sep 30 '22 03:09 dgl-bot

Commit ID: 5cb966b04233454e21deb8870f8ee809c1cbfa3e

Build ID: 10

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Oct 08 '22 07:10 dgl-bot

Commit ID: eff8f6d7f1fb328d778e6fbd02c9e04d41066098

Build ID: 11

Status: ❌ CI test failed in Stage [Tensorflow CPU Unit test].

Report path: link

Full logs path: link

dgl-bot avatar Oct 08 '22 09:10 dgl-bot

Commit ID: 91e2857057f529d5981499c5dafaf4aea7d39f92

Build ID: 13

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot avatar Oct 10 '22 04:10 dgl-bot

Commit ID: 79e42dabb412ea2d9c1d4676feaa52a3287acdb8

Build ID: 14

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

dgl-bot avatar Oct 11 '22 07:10 dgl-bot

Commit ID: f965c25f02106dd53f62c4e01d5277340d520159

Build ID: 16

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot avatar Oct 18 '22 10:10 dgl-bot

Commit ID: 3b0a72eac4affe89ced4cdbf7ad457b11fc7702b

Build ID: 17

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot avatar Oct 18 '22 11:10 dgl-bot

Commit ID: 7d906389aedf47ed79dcfba7dc6218f4592377be

Build ID: 18

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot avatar Oct 18 '22 11:10 dgl-bot

Commit ID: 900296f3b99107660a7fbefdb84f4ea753b4a8a9

Build ID: 19

Status: ❌ CI test failed in Stage [Torch CPU (Win64) Unit test].

Report path: link

Full logs path: link

dgl-bot avatar Oct 19 '22 02:10 dgl-bot

Converted this to draft PR. Not for review.

jermainewang avatar Oct 27 '22 08:10 jermainewang

Closed as the function has been merged via #4749 and #4748

jermainewang avatar Oct 29 '22 07:10 jermainewang