dgl
dgl copied to clipboard
[Sampling] Sampling with edge masks
Description
This PR extends the prob argument of sample_neighbors to accept a boolean tensor to represent which edges are eligible for sampling. It also adds a mask argument to dgl.dataloading.NeighborSampler for the same purpose.
g.edata['mask'] = torch.BoolTensor(...)
sampler = NeighborSampler([5, 10, 15], mask='mask')
I refrained from adding a mask argument in sample_neighbors function itself because that means it could change the interface of GraphStorage specification, and I'm not sure what is the workflow about it.
Also included a fix for the issue that if -1 is specified as fanout and some of the edges have probability 0, they can still be selected in the output.
Closes #4354 . Also closes #4441 . Replaces #4621.
Checklist
Please feel free to remove inapplicable items for your PR.
- [x] The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
- [x] Changes are complete (i.e. I finished coding on this PR)
- [x] All changes have test coverage
- [x] Code is well-documented
- [x] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
- [x] Related issue is referred in this PR
To trigger regression tests:
@dgl-bot run [instance-type] [which tests] [compare-with-branch]; For example:@dgl-bot run g4dn.4xlarge all dmlc/masteror@dgl-bot run c5.9xlarge kernel,api dmlc/master
Commit ID: 9d2257c3ab6538b68a84ff3d9c174eb009418597
Build ID: 1
Status: ❌ CI test failed in Stage [Torch GPU Unit test].
Report path: link
Full logs path: link
Commit ID: fccb57fcc366960cba334ca4e642ff84af59218e
Build ID: 2
Status: ❌ CI test failed in Stage [C++ CPU].
Report path: link
Full logs path: link
Commit ID: 6489cf151fd909c860a384444861db52a32f7daf
Build ID: 3
Status: ❌ CI test failed in Stage [Torch GPU Unit test].
Report path: link
Full logs path: link
Commit ID: 331fec111d0f3a6664c2a4cf4a0a1f93294a6d4c
Build ID: 4
Status: ✅ CI test succeeded
Report path: link
Full logs path: link
Commit ID: 4c19bea68e6fb6ae3fd33ee3ebb0272c156b486e
Build ID: 5
Status: ✅ CI test succeeded
Report path: link
Full logs path: link
I realized that the local partition graph does not contain any node data and edge data of that partition. Instead, they are loaded into the KVStore instead: see https://github.com/dmlc/dgl/blob/master/python/dgl/distributed/dist_graph.py#L335-L336 and https://github.com/dmlc/dgl/blob/master/python/dgl/distributed/dist_graph.py#L375
Does it make sense to put the features inside the local partition DGLGraph (i.e. self.client_g) after they are loaded?
@Rhett-Ying
Commit ID: 75319e1fb5baebea253da6e8664e3a86f65dcaff
Build ID: 6
Status: ❌ CI test failed in Stage [Lint Check].
Report path: link
Full logs path: link
Turns out that the workload is much larger than I initially anticipated.
The original sample_etype_neighbors interface assumed that the probabilities of all edge types are stored in the same single tensor, which is not the case for distributed training. I reserved a week as a buffer for possible issues and then spent a week changing the interface.
I only tested single machine so far. Will do more rigorous testing on multiple machine setting in the next cycle.
Commit ID: ef821ff110f2a5c21d0bc02db3b668f2c94e4e8e
Build ID: 7
Status: ❌ CI test failed in Stage [Lint Check].
Report path: link
Full logs path: link
Commit ID: b740050eb66e853530731968d7a38b00cee3ebfe
Build ID: 8
Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].
Report path: link
Full logs path: link
Commit ID: 9109b98cb2849ff84d18b63df21a17eb23cb4c63
Build ID: 9
Status: ✅ CI test succeeded
Report path: link
Full logs path: link
Commit ID: 5cb966b04233454e21deb8870f8ee809c1cbfa3e
Build ID: 10
Status: ✅ CI test succeeded
Report path: link
Full logs path: link
Commit ID: eff8f6d7f1fb328d778e6fbd02c9e04d41066098
Build ID: 11
Status: ❌ CI test failed in Stage [Tensorflow CPU Unit test].
Report path: link
Full logs path: link
Commit ID: 91e2857057f529d5981499c5dafaf4aea7d39f92
Build ID: 13
Status: ❌ CI test failed in Stage [Lint Check].
Report path: link
Full logs path: link
Commit ID: 79e42dabb412ea2d9c1d4676feaa52a3287acdb8
Build ID: 14
Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].
Report path: link
Full logs path: link
Commit ID: f965c25f02106dd53f62c4e01d5277340d520159
Build ID: 16
Status: ❌ CI test failed in Stage [Lint Check].
Report path: link
Full logs path: link
Commit ID: 3b0a72eac4affe89ced4cdbf7ad457b11fc7702b
Build ID: 17
Status: ❌ CI test failed in Stage [Lint Check].
Report path: link
Full logs path: link
Commit ID: 7d906389aedf47ed79dcfba7dc6218f4592377be
Build ID: 18
Status: ❌ CI test failed in Stage [Lint Check].
Report path: link
Full logs path: link
Commit ID: 900296f3b99107660a7fbefdb84f4ea753b4a8a9
Build ID: 19
Status: ❌ CI test failed in Stage [Torch CPU (Win64) Unit test].
Report path: link
Full logs path: link
Converted this to draft PR. Not for review.
Closed as the function has been merged via #4749 and #4748