dgl
dgl copied to clipboard
pin memory problem in dgx-a100
❓ Questions and Help
I used dgx-a100(8*a100) to train graphsage with UnifiedTensor, while it seems that there is something wrong. I thought a lot but found nothing about the reason and how to solve it. So what is the reason of the problem?
If source code could help, I would send you my source code. Thanks a lot.
I've encountered this error I believe. In my case, the issue was that DGL was attempting to pin a subgraph in which there's an empty edge list for one of the relation types.
Can you provide env information such as:
- DGL Version (e.g., 1.0):
- Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
- OS (e.g., Linux):
- How you installed DGL (
conda
,pip
, source): - Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version (if applicable):
- GPU models and configuration (e.g. V100):
- Any other relevant information:
Can you provide env information such as:
- DGL Version (e.g., 1.0):
- Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
- OS (e.g., Linux):
- How you installed DGL (
conda
,pip
, source):- Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version (if applicable):
- GPU models and configuration (e.g. V100):
- Any other relevant information:
Thanks for your reply.
- DGL Version (e.g., 1.0):dgl-cu111,0.8.2.post1
- Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):Pytorch 1.10.1+cu111
- OS (e.g., Linux):Ubuntu18.04
- How you installed DGL (
conda
,pip
, source):conda - Python version:3.8
- CUDA/cuDNN version (if applicable):CUDA11.2, cuDNN8.0
- GPU models and configuration (e.g. V100):8*A100
After some investigation, I think this issue is caused by IOMMU enabled in the OS. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#iommu-on-linux for more details.
To verify it, you can run the following code to see if the error still happens.
import torch
x = torch.arange(10).reshape(5, 2)
x.share_memory_()
cudart = torch.cuda.cudart()
r = cudart.cudaHostRegister(x.data_ptr(), x.numel() * x.element_size(), 0)
assert x.is_shared()
assert x.is_pinned()
After some investigation, I think this issue is caused by IOMMU enabled in the OS. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#iommu-on-linux for more details.
To verify it, you can run the following code to see if the error still happens.
import torch x = torch.arange(10).reshape(5, 2) x.share_memory_() cudart = torch.cuda.cudart() r = cudart.cudaHostRegister(x.data_ptr(), x.numel() * x.element_size(), 0) assert x.is_shared() assert x.is_pinned()
In this link, it seems that I should disable IOMMU. I have test this code and there is no error, so what is the API in this code that disables the IOMMU? By the way, when I train graphsage with small graph, there is no error. But when I train with large graph, the above error happens. It seems that there are something related to size. Is there some size limitation of pin memory?
The code doesn't disable IOMMU. I just want to check if the problem is caused by DGL or not. This PyTorch code calls the same underlying CUDA API as DGL does.
By the way, when I train graphsage with small graph, there is no error.
If training with small graphs works well, it shouldn't be the IOMMU issue.
But when I train with large graph, the above error happens. It seems that there are something related to size. Is there some size limitation of pin memory?
The size of pin memory cannot exceed the size of physical CPU RAM. How big is your data? Regarding you are using UnifiedTensor, can you try replacing dgl.contrib.UnifiedTensor(x, ...)
with cudart.cudaHostRegister(x.data_ptr(), x.numel() * x.element_size(), 0)
and see if the error still happens?
I attempt to use https://github.com/yaox12/dgl/blob/uva_sampling/examples/pytorch/graphsage/train_sampling_multi_gpu.py to train with paper100m, and I find the error happens between line 263 and line 268. The whole error is:
Process Process-5:
Traceback (most recent call last):
File "/root/anaconda3/envs/dgl/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/root/anaconda3/envs/dgl/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/multiprocessing/pytorch.py", line 33, in decorated_function
raise exception.__class__(trace)
dgl._ffi.base.DGLError: Traceback (most recent call last):
File "/root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/multiprocessing/pytorch.py", line 21, in _queue_result
res = func(*args, **kwargs)
File "train_sampling_multi_gpu.py", line 74, in run
train_nfeat = dgl.contrib.UnifiedTensor(train_nfeat, device=device)
File "/root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/contrib/unified_tensor.py", line 78, in __init__
self._array.pin_memory_()
File "/root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/_ffi/ndarray.py", line 322, in pin_memory_
check_call(_LIB.DGLArrayPinData(self.handle))
File "/root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/_ffi/base.py", line 65, in check_call
raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: [04:20:34] /opt/dgl/src/runtime/cuda/cuda_device_api.cc:183: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: OS call failed or operation not supported on this OS
Stack trace:
[bt] (0) /root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7fb88691beaf]
[bt] (1) /root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::runtime::CUDADeviceAPI::PinData(void*, unsigned long)+0xb4) [0x7fb886df1814]
[bt] (2) /root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::runtime::NDArray::PinData(DLTensor*)+0x16f) [0x7fb886c6658f]
[bt] (3) /root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/libdgl.so(DGLArrayPinData+0x6) [0x7fb886c66606]
[bt] (4) /root/anaconda3/envs/dgl/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7fba4776e9dd]
[bt] (5) /root/anaconda3/envs/dgl/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7fba4776e067]
[bt] (6) /root/anaconda3/envs/dgl/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7fba477871e9]
[bt] (7) /root/anaconda3/envs/dgl/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x13c95) [0x7fba47787c95]
[bt] (8) python(_PyObject_MakeTpCall+0x3bf) [0x556b05cfd13f]
for all process.
Cannot reproduce... I ran with python train_sampling_multi_gpu.py --gpu 0,1 --dataset ogbn-papers100M --data-device uva
and it works well.
Can you change the following two lines
train_nfeat = dgl.contrib.UnifiedTensor(train_nfeat, device=device)
train_labels = dgl.contrib.UnifiedTensor(train_labels, device=device)
to
cudart = th.cuda.cudart()
cudart.cudaHostRegister(train_nfeat.data_ptr(),
train_nfeat.numel() * train_nfeat.element_size(), 0)
cudart.cudaHostRegister(train_labels.data_ptr(),
train_labels.numel() * train_labels.element_size(), 0)
Though it will be a bit slower, but can help us figure out whether the error is caused by DGL or the OS.
Cannot reproduce... I ran with
python train_sampling_multi_gpu.py --gpu 0,1 --dataset ogbn-papers100M --data-device uva
and it works well. Can you change the following two linestrain_nfeat = dgl.contrib.UnifiedTensor(train_nfeat, device=device) train_labels = dgl.contrib.UnifiedTensor(train_labels, device=device)
to
cudart = th.cuda.cudart() cudart.cudaHostRegister(train_nfeat.data_ptr(), train_nfeat.numel() * train_nfeat.element_size(), 0) cudart.cudaHostRegister(train_labels.data_ptr(), train_labels.numel() * train_labels.element_size(), 0)
Though it will be a bit slower, but can help us figure out whether the error is caused by DGL or the OS.
Hello, after I replace the code, there is no error. So it seems that there is something wrong in DGL. By the way, when I execute on v100, there is no error, which made me very confused.
Do you have ideas on this issue? @nv-dlasalle @davidmin7
Cannot reproduce... I ran with
python train_sampling_multi_gpu.py --gpu 0,1 --dataset ogbn-papers100M --data-device uva
and it works well. Can you change the following two linestrain_nfeat = dgl.contrib.UnifiedTensor(train_nfeat, device=device) train_labels = dgl.contrib.UnifiedTensor(train_labels, device=device)
to
cudart = th.cuda.cudart() cudart.cudaHostRegister(train_nfeat.data_ptr(), train_nfeat.numel() * train_nfeat.element_size(), 0) cudart.cudaHostRegister(train_labels.data_ptr(), train_labels.numel() * train_labels.element_size(), 0)
Though it will be a bit slower, but can help us figure out whether the error is caused by DGL or the OS.
Hello~Could you give me a docker that is able to run python train_sampling_multi_gpu.py --gpu 0,1 --dataset ogbn-papers100M --data-device uva
on a100?