Segmentation fault when using GPUCachedFeature's sample code
(Link to the sample code)
https://www.dgl.ai/dgl_docs/_modules/dgl/graphbolt/impl/gpu_cached_feature.html#GPUCachedFeature
I made sure my installed dgl and pytorch versions are all 2.4.0+cu124. The program is run in a container.
import torch
from dgl import graphbolt as gb
torch_feat = torch.arange(10).reshape(2, -1).to("cuda")
cache_size = 5
fallback_feature = gb.TorchBasedFeature(torch_feat)
feature = gb.gpu_cached_feature(fallback_feature, cache_size)
# Segmentation fault occurs here.
feature.read()
feature.read(torch.tensor([0]).to("cuda"))
feature.update(
torch.tensor([[1 for _ in range(5)]]).to("cuda"), torch.tensor([1]).to("cuda")
)
feature.read(torch.tensor([0, 1]).to("cuda"))
feature.size()
The program always prompts segmentation fault when calling "graphbolt.feature_store.wrap_with_cached_feature(...)".
Try the following code:
import torch
from dgl import graphbolt as gb
torch_feat = torch.arange(10, pin_memory=True).reshape(2, -1)
cache_size = 5
fallback_feature = gb.TorchBasedFeature(torch_feat)
feature = gb.gpu_cached_feature(fallback_feature, cache_size)
# Segmentation fault was happening due to the fallback feature being already on the GPU
# While we use a GPU cache to cache it. We expect it to be on the CPU.
feature.read()
feature.read(torch.tensor([0]).to("cuda"))
feature.read(torch.tensor([0, 1]).to("cuda"))
feature.size()
You can see it running here: https://colab.research.google.com/drive/1fRuvM5GBqtogK8UDzlS-VpyoXeZ5zT8K?usp=sharing
When we use the gb.DataLoader with features, the read_async instead of read is utilized. read_async should have better support for more combinations of fallback feature placement, including 2 level caching with fallback feature being on the SSD.
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you