dgl icon indicating copy to clipboard operation
dgl copied to clipboard

Segmentation fault when using GPUCachedFeature's sample code

Open AtoshDustosh opened this issue 7 months ago • 4 comments

(Link to the sample code)

https://www.dgl.ai/dgl_docs/_modules/dgl/graphbolt/impl/gpu_cached_feature.html#GPUCachedFeature

I made sure my installed dgl and pytorch versions are all 2.4.0+cu124. The program is run in a container.

import torch
from dgl import graphbolt as gb

torch_feat = torch.arange(10).reshape(2, -1).to("cuda")
cache_size = 5
fallback_feature = gb.TorchBasedFeature(torch_feat)
feature = gb.gpu_cached_feature(fallback_feature, cache_size)
# Segmentation fault occurs here.


feature.read()
feature.read(torch.tensor([0]).to("cuda"))
feature.update(
    torch.tensor([[1 for _ in range(5)]]).to("cuda"), torch.tensor([1]).to("cuda")
)
feature.read(torch.tensor([0, 1]).to("cuda"))
feature.size()

The program always prompts segmentation fault when calling "graphbolt.feature_store.wrap_with_cached_feature(...)".

AtoshDustosh avatar May 21 '25 10:05 AtoshDustosh

Try the following code:

import torch
from dgl import graphbolt as gb

torch_feat = torch.arange(10, pin_memory=True).reshape(2, -1)
cache_size = 5
fallback_feature = gb.TorchBasedFeature(torch_feat)
feature = gb.gpu_cached_feature(fallback_feature, cache_size)
# Segmentation fault was happening due to the fallback feature being already on the GPU
# While we use a GPU cache to cache it. We expect it to be on the CPU.


feature.read()
feature.read(torch.tensor([0]).to("cuda"))
feature.read(torch.tensor([0, 1]).to("cuda"))
feature.size()

mfbalin avatar Jun 15 '25 23:06 mfbalin

You can see it running here: https://colab.research.google.com/drive/1fRuvM5GBqtogK8UDzlS-VpyoXeZ5zT8K?usp=sharing

mfbalin avatar Jun 15 '25 23:06 mfbalin

When we use the gb.DataLoader with features, the read_async instead of read is utilized. read_async should have better support for more combinations of fallback feature placement, including 2 level caching with fallback feature being on the SSD.

mfbalin avatar Jun 15 '25 23:06 mfbalin

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

github-actions[bot] avatar Jul 16 '25 01:07 github-actions[bot]