EmbodiedScan [Bug] Why mvdet uses do much RAM memory during evaluation? Shooting above 500+gb and causes the program to crash

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] I have read the FAQ documentation but cannot get the expected help.
[X] The bug has not been fixed in the latest version (dev-1.x) or latest version (dev-1.0).

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment

sys.platform: linux
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA H100 80GB HBM3
CUDA_HOME: /fs/applications/cuda/12.1.1
NVCC: Cuda compilation tools, release 12.1, V12.1.105
GCC: gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-18)
PyTorch: 2.2.1+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.17.1+cu121
OpenCV: 4.9.0
MMEngine: 0.10.3
MMDetection: 3.3.0
MMDetection3D: 1.4.0+
spconv2.0: False

Reproduces the problem - code sample

Reproduces the problem - command or script

python tools/train.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py

Reproduces the problem - error message

job schedular indicates TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.

Additional information

Evaluation shouldn't use that much memory, 500+gb is crazy!

Apr 02 '24 02:04 yxchng

Can you please describe the problem in more detail? Do you mean CUDA OOM?

Apr 03 '24 05:04 mxh1999

Not CUDA memory. RAM memory using more than 500gb. 内存用很多(>500gb)，不是显存。

Apr 03 '24 08:04 yxchng

That's interesting, the entire Embodiedscan dataset only takes up about 300G. Are you sure there are no other programs taking RAM up?

Apr 03 '24 08:04 mxh1999

@mxh1999 Model itself has to use some RAM too. Quite sure your machine has more than 500gb memory so you didn't pay attention to memory usage. Maybe you can check what is the peak memory usage? It shoots up to more than 500gb during evaluation phase.

Apr 04 '24 11:04 yxchng

I have also seldom encountered such cases recently, and maybe we will have a closer look at this issue. At the same time, we welcome more cues/information about this problem to help us locate it more quickly.

Apr 09 '24 05:04 Tai-Wang

I met this ram issue as well, when using 8 gpus with the ddp way to train the 3ddet;

it cost more than 500GB of RAM, 500G reaches my RAM limitation and ddp process will fail; When using 4 gpus to train, 4x4 batch size, it cost nearly 400GB of RAM.

i am not sure the situation if use slurm to start.

I am looking forward to any progress on this issue.

Apr 14 '24 18:04 iris0329

I have also seldom encountered such cases recently, and maybe we will have a closer look at this issue. At the same time, we welcome more cues/information about this problem to help us locate it more quickly.

The issue seems to be the mmengine dataset saving a copy of data_list per GPU rank in RAM during dataset initialization. A quick patch was to use shared memory for this data info list. However, data time would be affected by this. Looking forward to a better fix.

Apr 15 '24 08:04 henryzhengr

I have met the same question too. When I train the model on a server with 700G mem, everything is fine. When I move to a server with 200G mem, before it finish epoch 1, it will always trigger kernel OOM kill.

May 09 '24 08:05 Outlying3720

I have also seldom encountered such cases recently, and maybe we will have a closer look at this issue. At the same time, we welcome more cues/information about this problem to help us locate it more quickly.

The issue seems to be the mmengine dataset saving a copy of data_list per GPU rank in RAM during dataset initialization. A quick patch was to use shared memory for this data info list. However, data time would be affected by this. Looking forward to a better fix.

@henryzhengr hi, Could you share your solution code? That's will help me a lot. Thank you.

May 09 '24 10:05 Outlying3720

Solution

For quick solution just replace the following lines to the code below https://github.com/OpenRobotLab/EmbodiedScan/blob/67110231a8759009ca822ff3f2b3ed577674903b/embodiedscan/datasets/embodiedscan_dataset.py#L59-L64

Make sure to install SharedArray in your environment

Code:

    super().__init__(ann_file=ann_file,
                     metainfo=metainfo,
                     data_root=data_root,
                     pipeline=pipeline,
                     test_mode=test_mode,
                     serialize_data=False,
                     **kwargs)
    self.share_serialize_data()

def share_serialize_data(self):
    cur_rank, num_gpus = mmengine.dist.get_rank(), mmengine.dist.get_world_size()
    if cur_rank == 0:
        print("Rank 0 initialized the data")
        if os.path.exists(f"/dev/shm/embodiedscan_data_bytes"):
            self.data_bytes = SharedArray.attach(f"shm://embodiedscan_data_bytes")
            self.data_address = SharedArray.attach(f"shm://embodiedscan_data_address")
        else:
            self.data_bytes, self.data_address = self._serialize_data()
            print(f'Loading training data to shared memory (file limit not set)')
            data_bytes_shm_arr = SharedArray.create('shm://embodiedscan_data_bytes', self.data_bytes.shape, dtype=self.data_bytes.dtype)
            data_bytes_shm_arr[...] = self.data_bytes[...]
            data_bytes_shm_arr.flags.writeable = False

            data_address_shm_arr = SharedArray.create('shm://embodiedscan_data_address', self.data_address.shape, dtype=self.data_address.dtype)
            data_address_shm_arr[...] = self.data_address[...]
            data_address_shm_arr.flags.writeable = False
            print(f'Training data list has been saved to shared memory')
        dist.barrier()
    else:
        dist.barrier()
        print(f'Reading training data from shm. rank {cur_rank}')
        self.data_bytes = SharedArray.attach("shm://embodiedscan_data_bytes")
        self.data_address = SharedArray.attach("shm://embodiedscan_data_address")
        print(f'Done reading training data. rank {cur_rank}')

    self.serialize_data = True

Advantage

In original code, the more GPU you use the more RAM the program consumes. So this will save RAM for distributed training on multiple GPUs, but not on single GPU.

Disadvantages and stuff to take note

Cleanup: Make sure to unlink the shared array upon program termination to avoid memory leaks.
Data-loading Bottlenecks: In some cases, I've encountered slowdowns in data-loading, though unsure of the cause yet, Restarting the program solves my issue.
Timeout Issue: Currently, only the rank 0 GPU processes and transfers data to memory. This sometimes results in exceptions if the wait exceeds a predefined timeout. A more efficient approach might be to distribute the data processing task across all GPUs, allowing each rank to handle and transfer a portion of the data independently. (Or a lazy solution is just to set timeout longer :-) )

Jun 12 '24 09:06 henryzhengr

Thank @henryzhengr for providing a work-around solution. I close this issue for now and welcome further discussions if there are new observations.

Jul 05 '24 02:07 Tai-Wang

EmbodiedScan EmbodiedScan copied to clipboard

[Bug] Why mvdet uses do much RAM memory during evaluation? Shooting above 500+gb and causes the program to crash

Prerequisite

Task

Branch

Environment

Reproduces the problem - code sample

Reproduces the problem - command or script

Reproduces the problem - error message

Additional information

Solution

Code:

Advantage

Disadvantages and stuff to take note

EmbodiedScan
EmbodiedScan copied to clipboard