Open3D multi-GPU memory problems when HashSet initialization

Checklist

[X] I have searched for similar issues.
[X] For Python issues, I have tested with the latest development wheel.
[X] I have checked the release documentation and the latest documentation (for master branch).

Describe the issue

I have found some GPU memory problems when HashSet initialization:

I am using open3d 0.15.2 installed through pip

when I have multiple GPUs(tested on 8) and initialize hashset through open3d.core.HashSet, it allocates some GPU memory(about 389M) on each card. You can easily reproduce this problem through the following script 1 (o3d_mem_test.py).

Processes displayed by nvidia-smi: +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 104246 C python3 391MiB | | 1 N/A N/A 104246 C python3 389MiB | | 2 N/A N/A 104246. C python3 389MiB | | 3 N/A N/A 104246 C python3 389MiB | | 4 N/A N/A 104246 C python3 389MiB | | 5 N/A N/A 104246 C python3 389MiB | | 6 N/A N/A 104246. C python3 389MiB | | 7 N/A N/A 104246 C python3 389MiB | +-----------------------------------------------------------------------------+

when I use hashset in multi-process program, a typical usage is with torch.distributed(in this case, use VISIBLE_CUDA_DEVICES won't be an option), it seems like each of the processes taking up some memory of each GPU. For example, when I start 8 processes on 8 GPUs, each process will take about 390M memory on the 8 GPUs, as a result, about 3120M GPU memory be allocated on each GPU. You can easily reproduce this problem through the following script 2(o3d_dist_mem_test.py).

Processes displayed by nvidia-smi: +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 114962 C /home/tops/bin/python3 391MiB | | 0 N/A N/A 114963 C /home/tops/bin/python3 389MiB | | 0 N/A N/A 114964 C /home/tops/bin/python3 389MiB | | 0 N/A N/A 114965 C /home/tops/bin/python3 389MiB | | 0 N/A N/A 114966 C /home/tops/bin/python3 389MiB | | 0 N/A N/A 114967 C /home/tops/bin/python3 389MiB | | 0 N/A N/A 114968 C /home/tops/bin/python3 389MiB | | 0 N/A N/A 114969 C /home/tops/bin/python3 389MiB | | 1 N/A N/A 114962 C /home/tops/bin/python3 389MiB | | 1 N/A N/A 114963 C /home/tops/bin/python3 391MiB | | 1 N/A N/A 114964 C /home/tops/bin/python3 389MiB | | 1 N/A N/A 114965 C /home/tops/bin/python3 389MiB | | 1 N/A N/A 114966 C /home/tops/bin/python3 389MiB | | 1 N/A N/A 114967 C /home/tops/bin/python3 389MiB | | 1 N/A N/A 114968 C /home/tops/bin/python3 389MiB | | 1 N/A N/A 114969 C /home/tops/bin/python3 389MiB | | 2 N/A N/A 114962 C /home/tops/bin/python3 389MiB | | 2 N/A N/A 114963 C /home/tops/bin/python3 389MiB | | 2 N/A N/A 114964 C /home/tops/bin/python3 391MiB | ... ... | 6 N/A N/A 114968 C /home/tops/bin/python3 391MiB | | 6 N/A N/A 114969 C /home/tops/bin/python3 389MiB | | 7 N/A N/A 114962 C /home/tops/bin/python3 389MiB | | 7 N/A N/A 114963 C /home/tops/bin/python3 389MiB | | 7 N/A N/A 114964 C /home/tops/bin/python3 389MiB | | 7 N/A N/A 114965 C /home/tops/bin/python3 389MiB | | 7 N/A N/A 114966 C /home/tops/bin/python3 389MiB | | 7 N/A N/A 114967 C /home/tops/bin/python3 389MiB | | 7 N/A N/A 114968 C /home/tops/bin/python3 389MiB | | 7 N/A N/A 114969 C /home/tops/bin/python3 391MiB | +-----------------------------------------------------------------------------+

Thanks for great open3d, which helped us a lot, I wonder if there is any way to disable this GPU memory allocate? Thanks

Steps to reproduce the bug

# reproduce script 1: o3d_mem_test.py #
import open3d.core as o3c
import time
import subprocess

# python3 o3d_mem_test.py
# each GPU on the machine use about 389M
o3c_device = o3c.Device('CUDA:0')
hashset = o3c.HashSet(init_capacity=1000,
                      key_dtype=o3c.int64,
                      key_element_shape=o3c.SizeVector((1,)),
                      device=o3c_device)

print('hashset init done, use nvidia-smi check gpu mem:')
subprocess.Popen('nvidia-smi', shell=True)
time.sleep(2)

# reproduce script 2: o3d_dist_mem_test.py #
import os
import argparse
import time
import subprocess
import torch
import open3d.core as o3c


# each gpu init 2 processes and use about 782M
# python3 -m torch.distributed.launch --nproc_per_node=2 --master_port=$RANDOM o3d_dist_mem_test.py --gpus '0,1'
# each gpu init 4 processes and use about 1562M
# python3 -m torch.distributed.launch --nproc_per_node=4 --master_port=$RANDOM o3d_dist_mem_test.py --gpus '0,1,2,3'
# each gpu init 8 processes and use about 3120M
# python3 -m torch.distributed.launch --nproc_per_node=8 --master_port=$RANDOM o3d_dist_mem_test.py --gpus '0,1,2,3,4,5,6,7,8'
def main():
    parser = argparse.ArgumentParser(description='torch.distributed.launch + Open3D HashSet gpu mem Test')
    parser.add_argument("--local_rank", type=int, default=-1, help='auto filled distributed rank(GPU Id)')
    parser.add_argument("--gpus", type=str, default='0,1', required=True, help='GPUs to use.')
    args = parser.parse_args()

    # GPU config
    os.environ['CUDA_VISIBLE_DEVICES'] = args.gpus
    assert torch.cuda.is_available(), 'torch.cuda.is_available() False'
    # distribute init
    torch.distributed.init_process_group(backend="nccl", init_method="env://")
    assert torch.distributed.get_rank() == args.local_rank, 'args.local_rank != torch.distributed.get_rank()'
    device = torch.device("cuda", args.local_rank)
    torch.cuda.set_device(device)

    print('device:{:s}, init hashset start.'.format(str(device)))
    o3c_device = o3c.Device(str(device))
    hashset = o3c.HashSet(init_capacity=1000,
                          key_dtype=o3c.int64,
                          key_element_shape=o3c.SizeVector((1,)),
                          device=o3c_device)
    print('device:{:s}, hashset init done, use nvidia-smi check gpu mem:'.format(str(device)))
    subprocess.Popen('nvidia-smi', shell=True)
    time.sleep(20)


if __name__ == '__main__':
    main()

Error message

None

Expected behavior

Through like o3c.Device('CUDA:gpu_id'), only use the GPU I specify.

Open3D, Python and System information

- Operating system: Ubuntu 20.04
- Python version: Python 3.8
- Open3D version: 0.15.2
- System architecture: x86
- Is this a remote workstation?: yes
- How did you install Open3D?: pip

Additional information

No response

Jun 07 '22 05:06 MingChaoSun

Hi I have also problems with multi GPU. Example ./OnlineSLAMRGBD --device CUDA:0 works fine , but when use CUDA:1 it crashes just after processing : (Open3D_15) ola@dig6:~/Proj2/Open3D_15/Open3D/build/bin/examples$ ./OnlineSLAMRGBD --device CUDA:0 [Open3D INFO] Using device CUDA:0. [Open3D INFO] Using Primesense default intrinsics. FEngine (64 bits) created at 0x7f8275314010 (threading is enabled) FEngine resolved backend: OpenGL [Open3D INFO] Writing reconstruction to scene.ply... [Open3D INFO] Writing trajectory to trajectory.log... (Open3D_15) ola@dig6:~/Proj2/Open3D_15/Open3D/build/bin/examples$ ./OnlineSLAMRGBD --device CUDA:1 [Open3D INFO] Using device CUDA:1. [Open3D INFO] Using Primesense default intrinsics. FEngine (64 bits) created at 0x7f584c2d5010 (threading is enabled) FEngine resolved backend: OpenGL terminate called after throwing an instance of 'thrust::system::system_error' what(): tabulate: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered Aborted (core dumped) (Open3D_15) ola@dig6:~/Proj2/Open3D_15/Open3D/build/bin/examples$ System : CUDA 11.4 , Ubuntu 20.04 , Open3D 15.2 , c++ , build from source.

Kind regards Ola

Sep 06 '22 14:09 olagt

I've met a similar multi gpu problem. I can't force open3d to used only one gpu. I wondered have you solved this?

Dec 29 '23 06:12 Buffyqsf

Open3D Open3D copied to clipboard

multi-GPU memory problems when HashSet initialization

Checklist

Describe the issue

Steps to reproduce the bug

Error message

Expected behavior

Open3D, Python and System information

Additional information

Open3D
Open3D copied to clipboard