dgl
dgl copied to clipboard
Dataloader sampling blocks on a different graph
🚀 Feature
For link prediction tasks's dataloading it would be nice to have back the possibility of passing a sampling graph on which the sampling of blocks is performed (i.e. DGL's 0.7 g_sample
inside EdgeDataLoader
), separately from the graph on which edge subgraph is constructed.
Motivation
In DGL 0.7 it was straightforward to implement a validation/test procedure via EdgeDataLoader, where one would use a val/test graph to get edge subgraphs batches, and a train graph to build the blocks structure. It would prevent information leakage issues via edges by construction (i.e. the val/test edges were guaranteed not to be present in blocks).
From DGL 0.8 and onwards, if we want to implement such test/val dataloader for link prediction tasks, we need to:
- Have a graph that will contain train, val and test edges from which the edge subgraphs will be sampled from.
- We can limit the edge subgraphs' edges to only val/test by passing appropriate ids to the dataloader class, via
indices
argument. - But, we can't easily sample blocks from a different graph that does not contain val/test edges.
To sample blocks from a different graph, we need to either write a custom sampling class (where we would override the neighbor sampling function to sample from a second graph), or to write a custom
exclude
function for thedgl.dataloading.as_edge_prediction_sampler
(where we would exclude all ids that are relative to test/val graph, AS WELL AS all reverse ids in case of heterographs), which makes it unnecessarily complicated to solve this very common problem.
So what we want to achieve from this blocks sampling proposal is the:
- Exclusion of edges that were in test/val graph
- Sampling of edges only from train graph
- Exclusion of reverse edges, given a reverse edge mapping
Alternatives
Right now, to ensure that there are no test/val edges inside blocks we can define one of the following:
- A custom exclude function for
dgl.dataloading.as_edge_prediction_sampler
. - A custom subclass of
dgl.dataloading.NeighborSampler
that will force the sampling to be performed on a different graph.
Both of those do not seem very intuitive, as well as being error-prone, compared to how it was done before in DGL 0.7.
I think an alternative would be adding an argument eids_always_exclude
saying "we need to exclude these edges no matter what".
dgl.dataloading.as_edge_prediction_sampler(sampler, exclude='reverse', reverse_eids=reverse_eids, eids_always_exclude=torch.cat([val_edges, test_edges]))
One caveat of the alternative above is if one only wants to exclude certain edges but not anything else, then using g_sampling
will be a bit faster since the alternative above still requires edge exclusion during every sampling step, whereas using g_sampling
does not need that.
@alexpod1000 Which one do you prefer?
cc @jermainewang .
EDIT: one con for g_sampling
is that you will need two graphs, which could cost twice as much memory if the graph is huge.
Sorry for the late reply.
I also had same conclusion about g_sampling
. I think that the memory overhead requiring two graphs would give more issues than eids_always_exclude
option, especially for larger graphs, so it'd be better to proceed with the exclude.
As graphbolt has released in 2024, can you take a look at https://docs.dgl.ai/stochastic_training/link_prediction.html to see whether it solves your issue. Feel free to reopen if the problem still persist.