pytorch_geometric
pytorch_geometric copied to clipboard
[Roadmap] Remote Backend Support and Integration 🚀
Motivation
PyG currently requires users to store graphs (and associated node + edge features) in Data and HeteroData objects, which are accepted by loaders to run forward/backward passes on an accelerator of choice. This abstraction, however, does not scale to large graphs (or large feature tensors), which can quickly oversubscribe CPU DRAM (despite the GPU VRAM requirements only being the memory consumption of each sampled subgraph and its associated node and edge features). Indeed, one can imagine storing graph features (and the graph itself) in "remote backends", which provide fixed operators that can be used to integrate cleanly with downstream PyG samplers and loaders.
The goal of this roadmap is to track the integration of native remote backend support into PyG. At a high level, this will be accomplished through the feature store, graph store, and sampler abstractions into PyG. For more freeform discussion, please visit the #scalability channel in the PyG Slack community.
Implementation
Abstractions: FeatureStore, GraphStore, Sampler
- [x] Let
DataandHeteroDataimplement theFeatureStoreabstraction (#4807) - [x] Define a
GraphStoreabstraction that is intended to hold anedge_indexin memory (#4816) - [x] Let
DataandHeteroDataimplement theGraphStoreabstraction (#4816) - [x] Modify
NeighborLoaderto callFeatureStoreandGraphStoremethods instead of theirData/HeteroDatacounterparts. Note that this will require moving filtering of data into the feature store. The new interface will look likedata: Union[Union[Data, HeteroData], Tuple[FeatureStore, GraphStore]](#4817, #4883) - [x] Implement
BaseSamplerand refactor existing samplers behind a common interface (#5312, #5365, #5402) - [x] Introduce
NodeLoaderandLinkLoader, refactor existing loaders behind loader + sampler interface (#5404, #5418) - [ ] Support (optional) methods to obtain a
TensorAttrorEdgeAttrfrom aFeatureStore/GraphStorefrom their first dataclass attribute, and refactor existing computations that get all (tensors, edges) and subsequently filter to use these methods. - [ ] Support variable samplers in
LightningNodeDataandLightningLinkData
Implementations
- [ ] Implement a concrete
FeatureStore,GraphStore, andSamplerwith a popular backend to provide example usage. Some thoughts here include a RayRandomAccessDatasetfor a feature store and a Neo4j graph for a graph store. - [ ] Implement a validation class that operates on
Tuple[FeatureStore, GraphStore]to perform basic sanity checks (in a similar way thatDataandHeteroDatado today) - [ ] Implement sampling from edges in the
HGTSampler - [ ] Implement (to the extent possible) samplers in
torch_geometric/loader(e.g. GraphSAINT, ShaDow) behind the sampler interface, enabling (a) easy extension to sampling from edges and (b) ease of extension to reote backedns in the future.
Code Health
- [x] Implement a remote backend utility class to consolidate common methods across feature and graph stores (#5307)
- [ ] Consolidate conditionals for
Data,HeteroData, andTuple[FeatureStore, GraphStore]throughout the PyG codebase into a single conditional. This should be possible since bothDataandHeteroDataareFeatureStoreandGraphStores
I think we should add this point to the roadmap
- Implement a concrete
FeatureStoreusing some "popular" storage backend.
This will help us "test" the interface, and also demonstrate how people can build concrete FeatureStores. WDYT?
Also we could add
- Since
FeatureStoreandMaterializedGraphare independent. It would be nice to havevalidate(FeatureStore, MaterializedGraph)which checks things like 1.MaterializedGraphonly connectsnode_typepresent inFeatureStore2.max(edge_index)is bounded by number of nodes inFeatureStore.
Validate will mostly be a abstract class, with implementations over riding __call__(FeatureStore, MaterializedGraph).
Yes, @wsad1. I think these are good points. One thing we could do to showcase is to have a short example/tutorial on how to connect to a neo4j graph database or similar.
Can we also add some clean up tasks here? For example, relying more on FeatureStore and MaterializedGraph interfaces than BaseData.
@rusty1s @wsad1 thanks for those suggestions, agreed on both fronts. Will incorporate tomorrow :)
@mananshah99 just interested what you're planning for
Implement a concrete FeatureStore and GraphStore with a popular backend to provide example usage what backend are you thinking of supporting.
(also I slightly updated the description to link to graphstore, hope you don't mind)
Hi folks, this roadmap has been updated a bit to describe latest changes and a few potential further directions (cc @Padarn, I hope this helps address some of your questions as well). Feel free to add on, or let me know if you have any questions/comments/concerns!
Hi team, I wonder if current remote backend can support edge features. It would be great if we can access edge features such as mult-iclass labels in the remote resources such as DBs.
cc @mananshah99
I love seeing Ray and Neo4j on these items! 😄
Are there any updates on these items? I don't see anything listed in the repo or in the docs.
I saw an example of using Kuzu for a Remote GraphStore (via feature_store, graph_store = db.get_torch_geometric_remote_backend(mp.cpu_count())) but not anything for Neo4j (except https://neo4j.com/docs/graph-data-science-client/current/tutorials/import-sample-export-gnn/, which is more complicated than the Kuzu counterpart).
Thanks in advance!