[Feature] Gpu cache for node and edge data
Description
A gpu_cache that can be used to cache vertex or edge feature storage is implemented. The core implementation comes from the HugeCTR repository and this PR is basically a wrapper around the gpu_cache available in it. The planned use case for it is in the Dataloader to seamlessly wrap the feature storage and speedup access to the node or edge features.
Fixes issue #3461.
Example usage:
python train_sampling_unsupervised.py --graph-device=gpu --data-device=uva --cache-size=1000000 --dataset=ogbn-products
This gets 206k samples/sec while the without cache version gets 130k samples/sec. When all of the features are on the GPU without the cache, it gets 350k samples/sec.
Checklist
Please feel free to remove inapplicable items for your PR.
- [x] The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
- [x] Changes are complete (i.e. I finished coding on this PR)
- [x] All changes have test coverage
- [ ] Code is well-documented
- [x] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
- [x] Related issue is referred in this PR
To trigger regression tests:
@dgl-bot run [instance-type] [which tests] [compare-with-branch]; For example:@dgl-bot run g4dn.4xlarge all dmlc/masteror@dgl-bot run c5.9xlarge kernel,api dmlc/master
Commit ID: e7adf3dc3312743ea4730bf8172b78c67e5e2dfa
Build ID: 1
Status: ❌ CI test failed in Stage [Lint Check].
Report path: link
Full logs path: link
Commit ID: f64b6ff8e8d59154e4d85832f216b28775a13add
Build ID: 2
Status: ❌ CI test failed in Stage [GPU Build].
Report path: link
Full logs path: link
Commit ID: 17bc74889aa274396115e39c22967ab36c93f0c0
Build ID: 3
Status: ❌ CI test failed in Stage [GPU Build].
Report path: link
Full logs path: link
Commit ID: c4344b5436fee1cb71797fb0207a4f618da40b09
Build ID: 4
Status: ❌ CI test failed in Stage [GPU Build].
Report path: link
Full logs path: link
Commit ID: 9152230225e9bd4e4806a781c716a55a9057add0
Build ID: 5
Status: ❌ CI test failed in Stage [Lint Check].
Report path: link
Full logs path: link
Commit ID: cb028d60f564b6ea87512af0060be0df05481bb4
Build ID: 6
Status: ❌ CI test failed in Stage [GPU Build].
Report path: link
Full logs path: link
Commit ID: 61f31a64cb29238bf2e69081d7ccb0451f2d1873
Build ID: 7
Status: ❌ CI test failed in Stage [GPU Build].
Report path: link
Full logs path: link
Commit ID: 61931ba1ef44e9c0e8ca64d977f42dc24f1110b9
Build ID: 8
Status: ❌ CI test failed in Stage [GPU Build].
Report path: link
Full logs path: link
Hi, thanks for the contribution. I think having a GPU cache in overall is a good addition to DGL but we need to think though the user experience first before making changes. Here are my major questions/suggestions:
- Is it possible to fold the GPU cache into one of the
FeatureStorageclass? You could check out the existing feature storage classes here and the base class here. Perhaps you could create a subclass calledGPUCacheFeatureStorage. - How to minimize package dependencies? The PR currently introduces a new third party dependency to HugeCTR, which honestly I don't know much about. Is it possible to limit the dependency to Python side? If HugeCTR provides Python APIs for creating and accessing embedding cache, we could wrap it in a Python class.
- Let's not complicate the
train_unsupervisedscript further because its purpose is to educate novice users about unsupervised training. Your setting is more advanced and should be demonstrated with a standalone script.
@jermainewang HugeCTR doesn't currently expose this via pytorch--however I think this only uses a handful of CPP files from it, so alternatively we could include just the needed files in third_party instead of a submodule.
I agree, ideally we should have a way to wrap FeatureStore (or FeatureSource from #4431), so that regardless of where you're pulling features from, you could cache them on the training GPU to reduce traffic.
@jermainewang I have added GpuCacheFeatureStorage and CachedTensor classes for easier use of the GpuCache. With the addition of CachedTensor, it is very simple to use the GpuCache and the modifications to the existing example are minimal now. However, I can still take those changes out and create a standalone example for the GpuCache once the API of the GpuCache is finalized.
@nv-dlasalle I need feedback about the device argument of UnifiedTensor, GpuCache doesn't take a device argument and uses the default cuda device to put the cache in, should the API change so that the device argument is given from the user?
Commit ID: 4fa1994c8ad388564a25acb0f4258cf9b10880fc
Build ID: 9
Status: ❌ CI test failed in Stage [GPU Build].
Report path: link
Full logs path: link
Commit ID: cac30539f31a497fe0fac779719b5ed05415fbf6
Build ID: 10
Status: ❌ CI test failed in Stage [GPU Build].
Report path: link
Full logs path: link
@jermainewang HugeCTR doesn't currently expose this via pytorch--however I think this only uses a handful of CPP files from it, so alternatively we could include just the needed files in
third_partyinstead of a submodule.
Can you list the files to be included? Alternatively, we could borrow them into the source tree directly if there are not many and the license is compatible. The risk is that future patch in the upstream cannot be easily integrated here, which means we need to have an owner that knows them.
We need to think of how to use customized FeatureStorage with dgl.DGLGraph, in particular without creating a customized wrapper of the DGLGraph object - this is what we previously do but now I think it's burdensome in retrospect. Perhaps get_node_storage and get_edge_storage as methods of the GraphStorage is not a good option.
@jermainewang HugeCTR doesn't currently expose this via pytorch--however I think this only uses a handful of CPP files from it, so alternatively we could include just the needed files in
third_partyinstead of a submodule.Can you list the files to be included? Alternatively, we could borrow them into the source tree directly if there are not many and the license is compatible. The risk is that future patch in the upstream cannot be easily integrated here, which means we need to have an owner that knows them.
There are only 4 required source and header files needed for the gpu cache, basically the files under this directory: https://github.com/NVIDIA-Merlin/HugeCTR/tree/master/gpu_cache
Commit ID: 44b70bfce9ec470cb229f55ad7064a987138716a
Build ID: 11
Status: ❌ CI test failed in Stage [Lint Check].
Report path: link
Full logs path: link
Commit ID: 4632b639cd69072aedb365fccb21f0d828b906dc
Build ID: 12
Status: ❌ CI test failed in Stage [GPU Build].
Report path: link
Full logs path: link
Can I get a second round of reviews for the recent updates implementing FeatureStorage and a new example using it for GPUCache training?
Can I get a second round of reviews for the recent updates implementing FeatureStorage and a new example using it for GPUCache training?
Sorry for the late reply. We are waiting for a review of the Dataloader changes tomorrow, which will decide how to move forward with this PR.
Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:
@dgl-bot
Commit ID: 6b900daf2295685580dc08d08cb6e9b2b4ebc336
Build ID: 15
Status: ❌ CI test failed in Stage [Authentication].
Report path: link
Full logs path: link
@dgl-bot
Commit ID: 6b900daf2295685580dc08d08cb6e9b2b4ebc336
Build ID: 16
Status: ❌ CI test failed in Stage [Lint Check].
Report path: link
Full logs path: link
Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:
@dgl-bot
@dgl-bot