EmbodiedScan
EmbodiedScan copied to clipboard
[Bug] Why mvdet uses do much RAM memory during evaluation? Shooting above 500+gb and causes the program to crash
Prerequisite
- [X] I have searched Issues and Discussions but cannot get the expected help.
- [X] I have read the FAQ documentation but cannot get the expected help.
- [X] The bug has not been fixed in the latest version (dev-1.x) or latest version (dev-1.0).
Task
I'm using the official example scripts/configs for the officially supported tasks/models/datasets.
Branch
main branch https://github.com/open-mmlab/mmdetection3d
Environment
sys.platform: linux
Python: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA H100 80GB HBM3
CUDA_HOME: /fs/applications/cuda/12.1.1
NVCC: Cuda compilation tools, release 12.1, V12.1.105
GCC: gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-18)
PyTorch: 2.2.1+cu121
PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX512
- CUDA Runtime 12.1
- NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
- CuDNN 8.9.2
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,
TorchVision: 0.17.1+cu121
OpenCV: 4.9.0
MMEngine: 0.10.3
MMDetection: 3.3.0
MMDetection3D: 1.4.0+
spconv2.0: False
Reproduces the problem - code sample
Reproduces the problem - command or script
python tools/train.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py
Reproduces the problem - error message
job schedular indicates TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Additional information
Evaluation shouldn't use that much memory, 500+gb is crazy!
Can you please describe the problem in more detail? Do you mean CUDA OOM?
Not CUDA memory. RAM memory using more than 500gb. 内存用很多(>500gb),不是显存。
That's interesting, the entire Embodiedscan dataset only takes up about 300G. Are you sure there are no other programs taking RAM up?
@mxh1999 Model itself has to use some RAM too. Quite sure your machine has more than 500gb memory so you didn't pay attention to memory usage. Maybe you can check what is the peak memory usage? It shoots up to more than 500gb during evaluation phase.
I have also seldom encountered such cases recently, and maybe we will have a closer look at this issue. At the same time, we welcome more cues/information about this problem to help us locate it more quickly.
I met this ram issue as well, when using 8 gpus with the ddp way to train the 3ddet;
it cost more than 500GB of RAM, 500G reaches my RAM limitation and ddp process will fail; When using 4 gpus to train, 4x4 batch size, it cost nearly 400GB of RAM.
i am not sure the situation if use slurm to start.
I am looking forward to any progress on this issue.
I have also seldom encountered such cases recently, and maybe we will have a closer look at this issue. At the same time, we welcome more cues/information about this problem to help us locate it more quickly.
The issue seems to be the mmengine dataset saving a copy of data_list per GPU rank in RAM during dataset initialization. A quick patch was to use shared memory for this data info list. However, data time would be affected by this. Looking forward to a better fix.
I have met the same question too. When I train the model on a server with 700G mem, everything is fine. When I move to a server with 200G mem, before it finish epoch 1, it will always trigger kernel OOM kill.
I have also seldom encountered such cases recently, and maybe we will have a closer look at this issue. At the same time, we welcome more cues/information about this problem to help us locate it more quickly.
The issue seems to be the mmengine dataset saving a copy of data_list per GPU rank in RAM during dataset initialization. A quick patch was to use shared memory for this data info list. However, data time would be affected by this. Looking forward to a better fix.
@henryzhengr hi, Could you share your solution code? That's will help me a lot. Thank you.
Solution
For quick solution just replace the following lines to the code below https://github.com/OpenRobotLab/EmbodiedScan/blob/67110231a8759009ca822ff3f2b3ed577674903b/embodiedscan/datasets/embodiedscan_dataset.py#L59-L64
Make sure to install SharedArray in your environment
Code:
super().__init__(ann_file=ann_file,
metainfo=metainfo,
data_root=data_root,
pipeline=pipeline,
test_mode=test_mode,
serialize_data=False,
**kwargs)
self.share_serialize_data()
def share_serialize_data(self):
cur_rank, num_gpus = mmengine.dist.get_rank(), mmengine.dist.get_world_size()
if cur_rank == 0:
print("Rank 0 initialized the data")
if os.path.exists(f"/dev/shm/embodiedscan_data_bytes"):
self.data_bytes = SharedArray.attach(f"shm://embodiedscan_data_bytes")
self.data_address = SharedArray.attach(f"shm://embodiedscan_data_address")
else:
self.data_bytes, self.data_address = self._serialize_data()
print(f'Loading training data to shared memory (file limit not set)')
data_bytes_shm_arr = SharedArray.create('shm://embodiedscan_data_bytes', self.data_bytes.shape, dtype=self.data_bytes.dtype)
data_bytes_shm_arr[...] = self.data_bytes[...]
data_bytes_shm_arr.flags.writeable = False
data_address_shm_arr = SharedArray.create('shm://embodiedscan_data_address', self.data_address.shape, dtype=self.data_address.dtype)
data_address_shm_arr[...] = self.data_address[...]
data_address_shm_arr.flags.writeable = False
print(f'Training data list has been saved to shared memory')
dist.barrier()
else:
dist.barrier()
print(f'Reading training data from shm. rank {cur_rank}')
self.data_bytes = SharedArray.attach("shm://embodiedscan_data_bytes")
self.data_address = SharedArray.attach("shm://embodiedscan_data_address")
print(f'Done reading training data. rank {cur_rank}')
self.serialize_data = True
Advantage
In original code, the more GPU you use the more RAM the program consumes. So this will save RAM for distributed training on multiple GPUs, but not on single GPU.
Disadvantages and stuff to take note
- Cleanup: Make sure to unlink the shared array upon program termination to avoid memory leaks.
- Data-loading Bottlenecks: In some cases, I've encountered slowdowns in data-loading, though unsure of the cause yet, Restarting the program solves my issue.
- Timeout Issue: Currently, only the rank 0 GPU processes and transfers data to memory. This sometimes results in exceptions if the wait exceeds a predefined timeout. A more efficient approach might be to distribute the data processing task across all GPUs, allowing each rank to handle and transfer a portion of the data independently. (Or a lazy solution is just to set timeout longer :-) )
Thank @henryzhengr for providing a work-around solution. I close this issue for now and welcome further discussions if there are new observations.