dgl icon indicating copy to clipboard operation
dgl copied to clipboard

[Example] SEAL for OGBL

Open rudongyu opened this issue 3 years ago • 9 comments

Description

Checklist

Please feel free to remove inapplicable items for your PR.

  • [x] The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • [x] Changes are complete (i.e. I finished coding on this PR)
  • [x] All changes have test coverage
  • [x] Code is well-documented
  • [x] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • [ ] Related issue is referred in this PR
  • [x] If the PR is for a new model/paper, I've updated the example index here.

Changes

  • [x] Add an example for running seal on OGBL

TODO for dgl core

  • [ ] DRNLTransform
  • [ ] Sampling

rudongyu avatar Jul 25 '22 08:07 rudongyu

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch]; For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

dgl-bot avatar Jul 25 '22 08:07 dgl-bot

Commit ID: fdacec63849ce2faaa55d35c36e9643b9e855b68

Build ID: 1

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Jul 25 '22 08:07 dgl-bot

Done a first pass

mufeili avatar Jul 26 '22 08:07 mufeili

Commit ID: aee966b68ae6664f54dea945860a713c0d76d703

Build ID: 2

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 01 '22 07:08 dgl-bot

Have you tried re-running the scripts for a few epochs to verify successful running?

mufeili avatar Aug 01 '22 07:08 mufeili

Commit ID: a43f4df65c68d656c7f205d74ff993392d7de647

Build ID: 3

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 01 '22 07:08 dgl-bot

Have you tried re-running the scripts for a few epochs to verify successful running?

Yes, tested for a few epochs.

rudongyu avatar Aug 01 '22 08:08 rudongyu

See if you want to make a review. @jermainewang @BarclayII

mufeili avatar Aug 01 '22 08:08 mufeili

Commit ID: 94929052e1ee8fd96dc686cec5a93ed0cef4e095

Build ID: 4

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 01 '22 08:08 dgl-bot

Commit ID: e104a6c63adb7f48f3c8299c2762ff2e810ad440

Build ID: 5

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 17 '22 05:08 dgl-bot

@rudongyu If this is ready for review, you can request a review from Quan and me. Also perhaps we will want to move the sampler to the core codebase.

mufeili avatar Aug 19 '22 06:08 mufeili

@rudongyu If this is ready for review, you can request a review from Quan and me. Also perhaps we will want to move the sampler to the core codebase.

Let me change the sampler to the style of negative edges augmented graphs + eids first.

I will move it to the core codebase after a sanity check.

rudongyu avatar Aug 19 '22 08:08 rudongyu

Commit ID: a52aa4e5153f9b4a0d37353a146093113713b9f8

Build ID: 6

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 22 '22 02:08 dgl-bot

Commit ID: 1c56f9a24e9c18d2714da1aaa596f659e3bc7b5a

Build ID: 7

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 22 '22 02:08 dgl-bot

The dataloading of using dgl dataloader & sampler has a near 4x slower speed when i test it (almost the same computation for a batch). @BarclayII @mufeili Could either of you help to check? I provide a code snippet as test.py.

rudongyu avatar Aug 22 '22 02:08 rudongyu

Commit ID: 9ab4c72b8b60d98ce932458cd9bae9353d78d12a

Build ID: 8

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 22 '22 03:08 dgl-bot

Commit ID: b87beec000e4274dc1212c6ee18447da819f8c6a

Build ID: 9

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 22 '22 14:08 dgl-bot

Commit ID: 4d8c6a4984df82a417dbaff9f87be38c05956538

Build ID: 10

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 23 '22 07:08 dgl-bot

I run main.py on the ogbl-collab dataset, but got an exception at Line 387 graph = graph.to_simple(copy_edata=True, aggregator='sum'). The error message is dgl._ffi.base.DGLError: [09:32:16] /opt/dgl/src/array/kernel.cc:399: Check failed: (feat->dtype).code == kDLFloat (

Ereboas avatar Aug 23 '22 09:08 Ereboas

Commit ID: 62c590039b864fc99fb443b710c2b7d3305ee723

Build ID: 11

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 24 '22 01:08 dgl-bot

I run main.py on the ogbl-collab dataset, but got an exception at Line 387 graph = graph.to_simple(copy_edata=True, aggregator='sum'). The error message is dgl._ffi.base.DGLError: [09:32:16] /opt/dgl/src/array/kernel.cc:399: Check failed: (feat->dtype).code == kDLFloat (

a bug because int type aggregation in the to_simple transform not supported yet. fixed

rudongyu avatar Aug 24 '22 04:08 rudongyu

The PR is ready for another round of review. @jermainewang. However, the dataloading and the sampler are implemented a little bit complicated to fit current interface of dataloader. I thinks it's better to refactor and move it in dgl core after #4444 supported.

rudongyu avatar Aug 24 '22 04:08 rudongyu

Commit ID: 439482eb894958acdd4e2e1375fac184f6a44aea

Build ID: 12

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 24 '22 06:08 dgl-bot

Commit ID: ef37ca973af6113def011581d83e5b8c9533d005

Build ID: 13

Status: ❌ CI test failed in Stage [Torch GPU].

Report path: link

Full logs path: link

dgl-bot avatar Aug 30 '22 09:08 dgl-bot

Commit ID: dc82fa853a7574dfb70c82351c1e8cf48199a557

Build ID: 14

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 31 '22 02:08 dgl-bot

Commit ID: ff677461fdbf2d9559200295532e0847bdd4ba6d

Build ID: 15

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

dgl-bot avatar Aug 31 '22 02:08 dgl-bot

I simply modified the GPU device to run on, but got an error:

terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from process_events at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1470 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd6f3fea477 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x25742 (0x7fd7215e3742 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x20e61 (0x7fd7215dee61 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x41308 (0x7fd7215ff308 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x41542 (0x7fd7215ff542 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #5: at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional<c10::MemoryFormat>) + 0x7bf (0x7fd7228b7daf in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #6: at::detail::empty_cuda(c10::ArrayRef<long>, c10::ScalarType, c10::optional<c10::Device>, c10::optional<c10::MemoryFormat>) + 0x115 (0x7fd72ddfdd75 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #7: at::detail::empty_cuda(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x31 (0x7fd72ddfdfd1 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #8: at::native::empty_cuda(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x1f (0x7fd72dedc82f in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #9: <unknown function> + 0x2b57e28 (0x7fd6f6b96e28 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cu.so)
frame #10: <unknown function> + 0x2b57e9b (0x7fd6f6b96e9b in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cu.so)
frame #11: at::_ops::empty_memory_format::redispatch(c10::DispatchKeySet, c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0xe3 (0x7fd723385c83 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x200341f (0x7fd72361541f in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #13: at::_ops::empty_memory_format::call(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x1b7 (0x7fd7233c3657 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #14: at::empty(c10::ArrayRef<long>, c10::TensorOptions, c10::optional<c10::MemoryFormat>) + 0xb1 (0x7fd6dda1c20f in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/tensoradapter/pytorch/libtensoradapter_pytorch_1.12.0.so)
frame #15: torch::empty(c10::ArrayRef<long>, c10::TensorOptions, c10::optional<c10::MemoryFormat>) + 0x95 (0x7fd6dda1ddcc in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/tensoradapter/pytorch/libtensoradapter_pytorch_1.12.0.so)
frame #16: TAempty + 0x119 (0x7fd6dda19a38 in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/tensoradapter/pytorch/libtensoradapter_pytorch_1.12.0.so)
frame #17: dgl::runtime::NDArray::Empty(std::vector<long, std::allocator<long> >, DLDataType, DLContext) + 0xb6 (0x7fd6a3168f46 in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/libdgl.so)
frame #18: dgl::aten::NewIdArray(long, DLContext, unsigned char) + 0x6d (0x7fd6a2ded90d in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/libdgl.so)
frame #19: dgl::runtime::NDArray dgl::aten::impl::Range<(DLDeviceType)2, long>(long, long, DLContext) + 0x9a (0x7fd6a331058a in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/libdgl.so)
frame #20: dgl::aten::Range(long, long, unsigned char, DLContext) + 0x1fd (0x7fd6a2dedc9d in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/libdgl.so)
frame #21: dgl::UnitGraph::COO::Edges(unsigned long, std::string const&) const + 0x9b (0x7fd6a32c058b in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/libdgl.so)
frame #22: dgl::UnitGraph::Edges(unsigned long, std::string const&) const + 0xa1 (0x7fd6a32bb251 in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/libdgl.so)
frame #23: dgl::HeteroGraph::Edges(unsigned long, std::string const&) const + 0x2a (0x7fd6a31bac7a in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/libdgl.so)
frame #24: <unknown function> + 0x73af6c (0x7fd6a31c3f6c in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/libdgl.so)
frame #25: DGLFuncCall + 0x48 (0x7fd6a3147548 in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/libdgl.so)
frame #26: <unknown function> + 0x162ac (0x7fd6dd2152ac in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/_ffi/_cy3/core.cpython-39-x86_64-linux-gnu.so)
frame #27: <unknown function> + 0x167db (0x7fd6dd2157db in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/_ffi/_cy3/core.cpython-39-x86_64-linux-gnu.so)
frame #28: _PyObject_MakeTpCall + 0x347 (0x56361a329fa7 in /opt/conda/envs/pytorch/bin/python)
frame #29: <unknown function> + 0x69091 (0x56361a25e091 in /opt/conda/envs/pytorch/bin/python)
frame #30: <unknown function> + 0x12743 (0x7fd8265cc743 in /home/ubuntu/.vscode-server/extensions/ms-python.python-2022.12.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_frame_eval/pydevd_frame_evaluator.cpython-39-x86_64-linux-gnu.so)
frame #31: <unknown function> + 0x12aa17 (0x56361a31fa17 in /opt/conda/envs/pytorch/bin/python)
frame #32: <unknown function> + 0x14c328 (0x56361a341328 in /opt/conda/envs/pytorch/bin/python)
frame #33: <unknown function> + 0x1e5ed4 (0x56361a3daed4 in /opt/conda/envs/pytorch/bin/python)
frame #34: <unknown function> + 0x695e1 (0x56361a25e5e1 in /opt/conda/envs/pytorch/bin/python)
frame #35: <unknown function> + 0x12743 (0x7fd8265cc743 in /home/ubuntu/.vscode-server/extensions/ms-python.python-2022.12.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_frame_eval/pydevd_frame_evaluator.cpython-39-x86_64-linux-gnu.so)
frame #36: <unknown function> + 0x12aa17 (0x56361a31fa17 in /opt/conda/envs/pytorch/bin/python)
frame #37: <unknown function> + 0x14c328 (0x56361a341328 in /opt/conda/envs/pytorch/bin/python)
frame #38: PyObject_Call + 0xb4 (0x56361a341a84 in /opt/conda/envs/pytorch/bin/python)
frame #39: _PyEval_EvalFrameDefault + 0x39a6 (0x56361a324396 in /opt/conda/envs/pytorch/bin/python)
frame #40: <unknown function> + 0x12743 (0x7fd8265cc743 in /home/ubuntu/.vscode-server/extensions/ms-python.python-2022.12.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_frame_eval/pydevd_frame_evaluator.cpython-39-x86_64-linux-gnu.so)
frame #41: <unknown function> + 0x12aa17 (0x56361a31fa17 in /opt/conda/envs/pytorch/bin/python)
frame #42: _PyFunction_Vectorcall + 0xb9 (0x56361a331ff9 in /opt/conda/envs/pytorch/bin/python)
frame #43: _PyObject_FastCallDictTstate + 0x1a5 (0x56361a329745 in /opt/conda/envs/pytorch/bin/python)
frame #44: _PyObject_Call_Prepend + 0x69 (0x56361a33e5e9 in /opt/conda/envs/pytorch/bin/python)
frame #45: <unknown function> + 0x21ea85 (0x56361a413a85 in /opt/conda/envs/pytorch/bin/python)
frame #46: _PyObject_MakeTpCall + 0x347 (0x56361a329fa7 in /opt/conda/envs/pytorch/bin/python)
frame #47: <unknown function> + 0x683df (0x56361a25d3df in /opt/conda/envs/pytorch/bin/python)
frame #48: <unknown function> + 0x12743 (0x7fd8265cc743 in /home/ubuntu/.vscode-server/extensions/ms-python.python-2022.12.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_frame_eval/pydevd_frame_evaluator.cpython-39-x86_64-linux-gnu.so)
frame #49: <unknown function> + 0x12aa17 (0x56361a31fa17 in /opt/conda/envs/pytorch/bin/python)
frame #50: _PyFunction_Vectorcall + 0xb9 (0x56361a331ff9 in /opt/conda/envs/pytorch/bin/python)
frame #51: <unknown function> + 0x1e5ed4 (0x56361a3daed4 in /opt/conda/envs/pytorch/bin/python)
frame #52: <unknown function> + 0x695e1 (0x56361a25e5e1 in /opt/conda/envs/pytorch/bin/python)
frame #53: <unknown function> + 0x12743 (0x7fd8265cc743 in /home/ubuntu/.vscode-server/extensions/ms-python.python-2022.12.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_frame_eval/pydevd_frame_evaluator.cpython-39-x86_64-linux-gnu.so)
frame #54: <unknown function> + 0x12aa17 (0x56361a31fa17 in /opt/conda/envs/pytorch/bin/python)
frame #55: _PyEval_EvalCodeWithName + 0x47 (0x56361a31f6d7 in /opt/conda/envs/pytorch/bin/python)
frame #56: PyEval_EvalCodeEx + 0x39 (0x56361a31f689 in /opt/conda/envs/pytorch/bin/python)
frame #57: PyEval_EvalCode + 0x1b (0x56361a3dae3b in /opt/conda/envs/pytorch/bin/python)
frame #58: <unknown function> + 0x1ea8fd (0x56361a3df8fd in /opt/conda/envs/pytorch/bin/python)
frame #59: <unknown function> + 0x13d991 (0x56361a332991 in /opt/conda/envs/pytorch/bin/python)
frame #60: <unknown function> + 0x1e5ed4 (0x56361a3daed4 in /opt/conda/envs/pytorch/bin/python)
frame #61: <unknown function> + 0x69091 (0x56361a25e091 in /opt/conda/envs/pytorch/bin/python)
frame #62: <unknown function> + 0x12743 (0x7fd8265cc743 in /home/ubuntu/.vscode-server/extensions/ms-python.python-2022.12.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_frame_eval/pydevd_frame_evaluator.cpython-39-x86_64-linux-gnu.so)
frame #63: <unknown function> + 0x12aa17 (0x56361a31fa17 in /opt/conda/envs/pytorch/bin/python)

The problem happened at Line 466-471:

    for subgs, _ in train_loader:
        subgs = dgl.unbatch(subgs)
        if len(num_nodes) > 1000:
            break
        for subg in subgs:
            num_nodes.append(subg.num_nodes())

I tried to decrease batch_size/num_workers, but no help. Then I monitored the GPU util and found that even if I set CUDA:1 as the device, CUDA:0 has an abnormal ignorable Memory-Usage.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08    Driver Version: 510.73.08    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:17.0 Off |                    0 |
| N/A   47C    P0    60W / 300W |    728MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:18.0 Off |                    0 |
| N/A   42C    P0    57W / 300W |    898MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:19.0 Off |                    0 |
| N/A   41C    P0    44W / 300W |      3MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1A.0 Off |                    0 |
| N/A   44C    P0    46W / 300W |      3MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
......

My environment is:

dgl-cu116: 0.9.0
python: 3.9
pytorch: 1.12.0

Does someone know how to solve this issue?

Ereboas avatar Sep 02 '22 14:09 Ereboas

I simply modified the GPU device to run on, but got an error:

terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from process_events at /opt/conda/conda-bld/pytorch_1656352657443/work/c10/cuda/CUDACachingAllocator.cpp:1470 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd6f3fea477 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x25742 (0x7fd7215e3742 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x20e61 (0x7fd7215dee61 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x41308 (0x7fd7215ff308 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x41542 (0x7fd7215ff542 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #5: at::detail::empty_generic(c10::ArrayRef<long>, c10::Allocator*, c10::DispatchKeySet, c10::ScalarType, c10::optional<c10::MemoryFormat>) + 0x7bf (0x7fd7228b7daf in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #6: at::detail::empty_cuda(c10::ArrayRef<long>, c10::ScalarType, c10::optional<c10::Device>, c10::optional<c10::MemoryFormat>) + 0x115 (0x7fd72ddfdd75 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #7: at::detail::empty_cuda(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x31 (0x7fd72ddfdfd1 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #8: at::native::empty_cuda(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x1f (0x7fd72dedc82f in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #9: <unknown function> + 0x2b57e28 (0x7fd6f6b96e28 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cu.so)
frame #10: <unknown function> + 0x2b57e9b (0x7fd6f6b96e9b in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cu.so)
frame #11: at::_ops::empty_memory_format::redispatch(c10::DispatchKeySet, c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0xe3 (0x7fd723385c83 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x200341f (0x7fd72361541f in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #13: at::_ops::empty_memory_format::call(c10::ArrayRef<long>, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x1b7 (0x7fd7233c3657 in /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
frame #14: at::empty(c10::ArrayRef<long>, c10::TensorOptions, c10::optional<c10::MemoryFormat>) + 0xb1 (0x7fd6dda1c20f in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/tensoradapter/pytorch/libtensoradapter_pytorch_1.12.0.so)
frame #15: torch::empty(c10::ArrayRef<long>, c10::TensorOptions, c10::optional<c10::MemoryFormat>) + 0x95 (0x7fd6dda1ddcc in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/tensoradapter/pytorch/libtensoradapter_pytorch_1.12.0.so)
frame #16: TAempty + 0x119 (0x7fd6dda19a38 in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/tensoradapter/pytorch/libtensoradapter_pytorch_1.12.0.so)
frame #17: dgl::runtime::NDArray::Empty(std::vector<long, std::allocator<long> >, DLDataType, DLContext) + 0xb6 (0x7fd6a3168f46 in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/libdgl.so)
frame #18: dgl::aten::NewIdArray(long, DLContext, unsigned char) + 0x6d (0x7fd6a2ded90d in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/libdgl.so)
frame #19: dgl::runtime::NDArray dgl::aten::impl::Range<(DLDeviceType)2, long>(long, long, DLContext) + 0x9a (0x7fd6a331058a in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/libdgl.so)
frame #20: dgl::aten::Range(long, long, unsigned char, DLContext) + 0x1fd (0x7fd6a2dedc9d in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/libdgl.so)
frame #21: dgl::UnitGraph::COO::Edges(unsigned long, std::string const&) const + 0x9b (0x7fd6a32c058b in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/libdgl.so)
frame #22: dgl::UnitGraph::Edges(unsigned long, std::string const&) const + 0xa1 (0x7fd6a32bb251 in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/libdgl.so)
frame #23: dgl::HeteroGraph::Edges(unsigned long, std::string const&) const + 0x2a (0x7fd6a31bac7a in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/libdgl.so)
frame #24: <unknown function> + 0x73af6c (0x7fd6a31c3f6c in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/libdgl.so)
frame #25: DGLFuncCall + 0x48 (0x7fd6a3147548 in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/libdgl.so)
frame #26: <unknown function> + 0x162ac (0x7fd6dd2152ac in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/_ffi/_cy3/core.cpython-39-x86_64-linux-gnu.so)
frame #27: <unknown function> + 0x167db (0x7fd6dd2157db in /home/ubuntu/.local/lib/python3.9/site-packages/dgl/_ffi/_cy3/core.cpython-39-x86_64-linux-gnu.so)
frame #28: _PyObject_MakeTpCall + 0x347 (0x56361a329fa7 in /opt/conda/envs/pytorch/bin/python)
frame #29: <unknown function> + 0x69091 (0x56361a25e091 in /opt/conda/envs/pytorch/bin/python)
frame #30: <unknown function> + 0x12743 (0x7fd8265cc743 in /home/ubuntu/.vscode-server/extensions/ms-python.python-2022.12.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_frame_eval/pydevd_frame_evaluator.cpython-39-x86_64-linux-gnu.so)
frame #31: <unknown function> + 0x12aa17 (0x56361a31fa17 in /opt/conda/envs/pytorch/bin/python)
frame #32: <unknown function> + 0x14c328 (0x56361a341328 in /opt/conda/envs/pytorch/bin/python)
frame #33: <unknown function> + 0x1e5ed4 (0x56361a3daed4 in /opt/conda/envs/pytorch/bin/python)
frame #34: <unknown function> + 0x695e1 (0x56361a25e5e1 in /opt/conda/envs/pytorch/bin/python)
frame #35: <unknown function> + 0x12743 (0x7fd8265cc743 in /home/ubuntu/.vscode-server/extensions/ms-python.python-2022.12.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_frame_eval/pydevd_frame_evaluator.cpython-39-x86_64-linux-gnu.so)
frame #36: <unknown function> + 0x12aa17 (0x56361a31fa17 in /opt/conda/envs/pytorch/bin/python)
frame #37: <unknown function> + 0x14c328 (0x56361a341328 in /opt/conda/envs/pytorch/bin/python)
frame #38: PyObject_Call + 0xb4 (0x56361a341a84 in /opt/conda/envs/pytorch/bin/python)
frame #39: _PyEval_EvalFrameDefault + 0x39a6 (0x56361a324396 in /opt/conda/envs/pytorch/bin/python)
frame #40: <unknown function> + 0x12743 (0x7fd8265cc743 in /home/ubuntu/.vscode-server/extensions/ms-python.python-2022.12.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_frame_eval/pydevd_frame_evaluator.cpython-39-x86_64-linux-gnu.so)
frame #41: <unknown function> + 0x12aa17 (0x56361a31fa17 in /opt/conda/envs/pytorch/bin/python)
frame #42: _PyFunction_Vectorcall + 0xb9 (0x56361a331ff9 in /opt/conda/envs/pytorch/bin/python)
frame #43: _PyObject_FastCallDictTstate + 0x1a5 (0x56361a329745 in /opt/conda/envs/pytorch/bin/python)
frame #44: _PyObject_Call_Prepend + 0x69 (0x56361a33e5e9 in /opt/conda/envs/pytorch/bin/python)
frame #45: <unknown function> + 0x21ea85 (0x56361a413a85 in /opt/conda/envs/pytorch/bin/python)
frame #46: _PyObject_MakeTpCall + 0x347 (0x56361a329fa7 in /opt/conda/envs/pytorch/bin/python)
frame #47: <unknown function> + 0x683df (0x56361a25d3df in /opt/conda/envs/pytorch/bin/python)
frame #48: <unknown function> + 0x12743 (0x7fd8265cc743 in /home/ubuntu/.vscode-server/extensions/ms-python.python-2022.12.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_frame_eval/pydevd_frame_evaluator.cpython-39-x86_64-linux-gnu.so)
frame #49: <unknown function> + 0x12aa17 (0x56361a31fa17 in /opt/conda/envs/pytorch/bin/python)
frame #50: _PyFunction_Vectorcall + 0xb9 (0x56361a331ff9 in /opt/conda/envs/pytorch/bin/python)
frame #51: <unknown function> + 0x1e5ed4 (0x56361a3daed4 in /opt/conda/envs/pytorch/bin/python)
frame #52: <unknown function> + 0x695e1 (0x56361a25e5e1 in /opt/conda/envs/pytorch/bin/python)
frame #53: <unknown function> + 0x12743 (0x7fd8265cc743 in /home/ubuntu/.vscode-server/extensions/ms-python.python-2022.12.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_frame_eval/pydevd_frame_evaluator.cpython-39-x86_64-linux-gnu.so)
frame #54: <unknown function> + 0x12aa17 (0x56361a31fa17 in /opt/conda/envs/pytorch/bin/python)
frame #55: _PyEval_EvalCodeWithName + 0x47 (0x56361a31f6d7 in /opt/conda/envs/pytorch/bin/python)
frame #56: PyEval_EvalCodeEx + 0x39 (0x56361a31f689 in /opt/conda/envs/pytorch/bin/python)
frame #57: PyEval_EvalCode + 0x1b (0x56361a3dae3b in /opt/conda/envs/pytorch/bin/python)
frame #58: <unknown function> + 0x1ea8fd (0x56361a3df8fd in /opt/conda/envs/pytorch/bin/python)
frame #59: <unknown function> + 0x13d991 (0x56361a332991 in /opt/conda/envs/pytorch/bin/python)
frame #60: <unknown function> + 0x1e5ed4 (0x56361a3daed4 in /opt/conda/envs/pytorch/bin/python)
frame #61: <unknown function> + 0x69091 (0x56361a25e091 in /opt/conda/envs/pytorch/bin/python)
frame #62: <unknown function> + 0x12743 (0x7fd8265cc743 in /home/ubuntu/.vscode-server/extensions/ms-python.python-2022.12.1/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_frame_eval/pydevd_frame_evaluator.cpython-39-x86_64-linux-gnu.so)
frame #63: <unknown function> + 0x12aa17 (0x56361a31fa17 in /opt/conda/envs/pytorch/bin/python)

The problem happened at Line 466-471:

    for subgs, _ in train_loader:
        subgs = dgl.unbatch(subgs)
        if len(num_nodes) > 1000:
            break
        for subg in subgs:
            num_nodes.append(subg.num_nodes())

I tried to decrease batch_size/num_workers, but no help. Then I monitored the GPU util and found that even if I set CUDA:1 as the device, CUDA:0 has an abnormal ignorable Memory-Usage.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08    Driver Version: 510.73.08    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:17.0 Off |                    0 |
| N/A   47C    P0    60W / 300W |    728MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:18.0 Off |                    0 |
| N/A   42C    P0    57W / 300W |    898MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:19.0 Off |                    0 |
| N/A   41C    P0    44W / 300W |      3MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1A.0 Off |                    0 |
| N/A   44C    P0    46W / 300W |      3MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
......

My environment is:

dgl-cu116: 0.9.0
python: 3.9
pytorch: 1.12.0

Does someone know how to solve this issue?

How did you trigger this? @rudongyu for awareness

mufeili avatar Sep 04 '22 04:09 mufeili