dgl icon indicating copy to clipboard operation
dgl copied to clipboard

Dataloader sampling blocks on a different graph

Open alex0dd opened this issue 2 years ago • 2 comments

🚀 Feature

For link prediction tasks's dataloading it would be nice to have back the possibility of passing a sampling graph on which the sampling of blocks is performed (i.e. DGL's 0.7 g_sample inside EdgeDataLoader), separately from the graph on which edge subgraph is constructed.

Motivation

In DGL 0.7 it was straightforward to implement a validation/test procedure via EdgeDataLoader, where one would use a val/test graph to get edge subgraphs batches, and a train graph to build the blocks structure. It would prevent information leakage issues via edges by construction (i.e. the val/test edges were guaranteed not to be present in blocks).

From DGL 0.8 and onwards, if we want to implement such test/val dataloader for link prediction tasks, we need to:

  1. Have a graph that will contain train, val and test edges from which the edge subgraphs will be sampled from.
  2. We can limit the edge subgraphs' edges to only val/test by passing appropriate ids to the dataloader class, via indices argument.
  3. But, we can't easily sample blocks from a different graph that does not contain val/test edges. To sample blocks from a different graph, we need to either write a custom sampling class (where we would override the neighbor sampling function to sample from a second graph), or to write a custom exclude function for the dgl.dataloading.as_edge_prediction_sampler (where we would exclude all ids that are relative to test/val graph, AS WELL AS all reverse ids in case of heterographs), which makes it unnecessarily complicated to solve this very common problem.

So what we want to achieve from this blocks sampling proposal is the:

  1. Exclusion of edges that were in test/val graph
  2. Sampling of edges only from train graph
  3. Exclusion of reverse edges, given a reverse edge mapping

Alternatives

Right now, to ensure that there are no test/val edges inside blocks we can define one of the following:

  1. A custom exclude function for dgl.dataloading.as_edge_prediction_sampler.
  2. A custom subclass of dgl.dataloading.NeighborSampler that will force the sampling to be performed on a different graph.

Both of those do not seem very intuitive, as well as being error-prone, compared to how it was done before in DGL 0.7.

alex0dd avatar Jun 01 '22 09:06 alex0dd

I think an alternative would be adding an argument eids_always_exclude saying "we need to exclude these edges no matter what".

dgl.dataloading.as_edge_prediction_sampler(sampler, exclude='reverse', reverse_eids=reverse_eids, eids_always_exclude=torch.cat([val_edges, test_edges]))

One caveat of the alternative above is if one only wants to exclude certain edges but not anything else, then using g_sampling will be a bit faster since the alternative above still requires edge exclusion during every sampling step, whereas using g_sampling does not need that. @alexpod1000 Which one do you prefer? cc @jermainewang .

EDIT: one con for g_sampling is that you will need two graphs, which could cost twice as much memory if the graph is huge.

BarclayII avatar Jun 09 '22 01:06 BarclayII

Sorry for the late reply. I also had same conclusion about g_sampling. I think that the memory overhead requiring two graphs would give more issues than eids_always_exclude option, especially for larger graphs, so it'd be better to proceed with the exclude.

alex0dd avatar Jun 14 '22 09:06 alex0dd

As graphbolt has released in 2024, can you take a look at https://docs.dgl.ai/stochastic_training/link_prediction.html to see whether it solves your issue. Feel free to reopen if the problem still persist.

frozenbugs avatar Apr 26 '24 06:04 frozenbugs