dgl icon indicating copy to clipboard operation
dgl copied to clipboard

pin memory problem in dgx-a100

Open zqj2333 opened this issue 2 years ago • 5 comments

❓ Questions and Help

I used dgx-a100(8*a100) to train graphsage with UnifiedTensor, while it seems that there is something wrong. I thought a lot but found nothing about the reason and how to solve it. So what is the reason of the problem? `U{5@GS%@UQ8V8CP8$2T68V

zqj2333 avatar Aug 11 '22 14:08 zqj2333

If source code could help, I would send you my source code. Thanks a lot.

zqj2333 avatar Aug 11 '22 14:08 zqj2333

I've encountered this error I believe. In my case, the issue was that DGL was attempting to pin a subgraph in which there's an empty edge list for one of the relation types.

kkranen avatar Aug 11 '22 20:08 kkranen

Can you provide env information such as:

  • DGL Version (e.g., 1.0):
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
  • OS (e.g., Linux):
  • How you installed DGL (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version (if applicable):
  • GPU models and configuration (e.g. V100):
  • Any other relevant information:

yaox12 avatar Aug 12 '22 06:08 yaox12

Can you provide env information such as:

  • DGL Version (e.g., 1.0):
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
  • OS (e.g., Linux):
  • How you installed DGL (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version (if applicable):
  • GPU models and configuration (e.g. V100):
  • Any other relevant information:

Thanks for your reply.

  • DGL Version (e.g., 1.0):dgl-cu111,0.8.2.post1
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):Pytorch 1.10.1+cu111
  • OS (e.g., Linux):Ubuntu18.04
  • How you installed DGL (conda, pip, source):conda
  • Python version:3.8
  • CUDA/cuDNN version (if applicable):CUDA11.2, cuDNN8.0
  • GPU models and configuration (e.g. V100):8*A100

zqj2333 avatar Aug 13 '22 02:08 zqj2333

After some investigation, I think this issue is caused by IOMMU enabled in the OS. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#iommu-on-linux for more details.

To verify it, you can run the following code to see if the error still happens.

import torch
x = torch.arange(10).reshape(5, 2)
x.share_memory_()
cudart = torch.cuda.cudart()
r = cudart.cudaHostRegister(x.data_ptr(), x.numel() * x.element_size(), 0)
assert x.is_shared()
assert x.is_pinned()

yaox12 avatar Aug 15 '22 02:08 yaox12

After some investigation, I think this issue is caused by IOMMU enabled in the OS. See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#iommu-on-linux for more details.

To verify it, you can run the following code to see if the error still happens.

import torch
x = torch.arange(10).reshape(5, 2)
x.share_memory_()
cudart = torch.cuda.cudart()
r = cudart.cudaHostRegister(x.data_ptr(), x.numel() * x.element_size(), 0)
assert x.is_shared()
assert x.is_pinned()

In this link, it seems that I should disable IOMMU. I have test this code and there is no error, so what is the API in this code that disables the IOMMU? By the way, when I train graphsage with small graph, there is no error. But when I train with large graph, the above error happens. It seems that there are something related to size. Is there some size limitation of pin memory?

zqj2333 avatar Aug 17 '22 03:08 zqj2333

The code doesn't disable IOMMU. I just want to check if the problem is caused by DGL or not. This PyTorch code calls the same underlying CUDA API as DGL does.

By the way, when I train graphsage with small graph, there is no error.

If training with small graphs works well, it shouldn't be the IOMMU issue.

But when I train with large graph, the above error happens. It seems that there are something related to size. Is there some size limitation of pin memory?

The size of pin memory cannot exceed the size of physical CPU RAM. How big is your data? Regarding you are using UnifiedTensor, can you try replacing dgl.contrib.UnifiedTensor(x, ...) with cudart.cudaHostRegister(x.data_ptr(), x.numel() * x.element_size(), 0) and see if the error still happens?

yaox12 avatar Aug 17 '22 03:08 yaox12

I attempt to use https://github.com/yaox12/dgl/blob/uva_sampling/examples/pytorch/graphsage/train_sampling_multi_gpu.py to train with paper100m, and I find the error happens between line 263 and line 268. The whole error is:

Process Process-5:
Traceback (most recent call last):
  File "/root/anaconda3/envs/dgl/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/root/anaconda3/envs/dgl/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/multiprocessing/pytorch.py", line 33, in decorated_function
    raise exception.__class__(trace)
dgl._ffi.base.DGLError: Traceback (most recent call last):
  File "/root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/multiprocessing/pytorch.py", line 21, in _queue_result
    res = func(*args, **kwargs)
  File "train_sampling_multi_gpu.py", line 74, in run
    train_nfeat = dgl.contrib.UnifiedTensor(train_nfeat, device=device)
  File "/root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/contrib/unified_tensor.py", line 78, in __init__
    self._array.pin_memory_()
  File "/root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/_ffi/ndarray.py", line 322, in pin_memory_
    check_call(_LIB.DGLArrayPinData(self.handle))
  File "/root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/_ffi/base.py", line 65, in check_call
    raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: [04:20:34] /opt/dgl/src/runtime/cuda/cuda_device_api.cc:183: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: OS call failed or operation not supported on this OS
Stack trace:
  [bt] (0) /root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7fb88691beaf]
  [bt] (1) /root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::runtime::CUDADeviceAPI::PinData(void*, unsigned long)+0xb4) [0x7fb886df1814]
  [bt] (2) /root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/libdgl.so(dgl::runtime::NDArray::PinData(DLTensor*)+0x16f) [0x7fb886c6658f]
  [bt] (3) /root/anaconda3/envs/dgl/lib/python3.8/site-packages/dgl/libdgl.so(DGLArrayPinData+0x6) [0x7fb886c66606]
  [bt] (4) /root/anaconda3/envs/dgl/lib/python3.8/lib-dynload/../../libffi.so.7(+0x69dd) [0x7fba4776e9dd]
  [bt] (5) /root/anaconda3/envs/dgl/lib/python3.8/lib-dynload/../../libffi.so.7(+0x6067) [0x7fba4776e067]
  [bt] (6) /root/anaconda3/envs/dgl/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x319) [0x7fba477871e9]
  [bt] (7) /root/anaconda3/envs/dgl/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0x13c95) [0x7fba47787c95]
  [bt] (8) python(_PyObject_MakeTpCall+0x3bf) [0x556b05cfd13f]

for all process.

zqj2333 avatar Aug 17 '22 04:08 zqj2333

Cannot reproduce... I ran with python train_sampling_multi_gpu.py --gpu 0,1 --dataset ogbn-papers100M --data-device uva and it works well. Can you change the following two lines

        train_nfeat = dgl.contrib.UnifiedTensor(train_nfeat, device=device)
        train_labels = dgl.contrib.UnifiedTensor(train_labels, device=device)

to

        cudart = th.cuda.cudart()
        cudart.cudaHostRegister(train_nfeat.data_ptr(),
            train_nfeat.numel() * train_nfeat.element_size(), 0)
        cudart.cudaHostRegister(train_labels.data_ptr(),
            train_labels.numel() * train_labels.element_size(), 0)

Though it will be a bit slower, but can help us figure out whether the error is caused by DGL or the OS.

yaox12 avatar Aug 17 '22 09:08 yaox12

Cannot reproduce... I ran with python train_sampling_multi_gpu.py --gpu 0,1 --dataset ogbn-papers100M --data-device uva and it works well. Can you change the following two lines

        train_nfeat = dgl.contrib.UnifiedTensor(train_nfeat, device=device)
        train_labels = dgl.contrib.UnifiedTensor(train_labels, device=device)

to

        cudart = th.cuda.cudart()
        cudart.cudaHostRegister(train_nfeat.data_ptr(),
            train_nfeat.numel() * train_nfeat.element_size(), 0)
        cudart.cudaHostRegister(train_labels.data_ptr(),
            train_labels.numel() * train_labels.element_size(), 0)

Though it will be a bit slower, but can help us figure out whether the error is caused by DGL or the OS.

Hello, after I replace the code, there is no error. So it seems that there is something wrong in DGL. By the way, when I execute on v100, there is no error, which made me very confused.

zqj2333 avatar Aug 17 '22 15:08 zqj2333

Do you have ideas on this issue? @nv-dlasalle @davidmin7

yaox12 avatar Aug 18 '22 01:08 yaox12

Cannot reproduce... I ran with python train_sampling_multi_gpu.py --gpu 0,1 --dataset ogbn-papers100M --data-device uva and it works well. Can you change the following two lines

        train_nfeat = dgl.contrib.UnifiedTensor(train_nfeat, device=device)
        train_labels = dgl.contrib.UnifiedTensor(train_labels, device=device)

to

        cudart = th.cuda.cudart()
        cudart.cudaHostRegister(train_nfeat.data_ptr(),
            train_nfeat.numel() * train_nfeat.element_size(), 0)
        cudart.cudaHostRegister(train_labels.data_ptr(),
            train_labels.numel() * train_labels.element_size(), 0)

Though it will be a bit slower, but can help us figure out whether the error is caused by DGL or the OS.

Hello~Could you give me a docker that is able to run python train_sampling_multi_gpu.py --gpu 0,1 --dataset ogbn-papers100M --data-device uva on a100?

zqj2333 avatar Aug 19 '22 13:08 zqj2333