mindocr icon indicating copy to clipboard operation
mindocr copied to clipboard

Issue on running distributed training

Open ThomasLimWZ opened this issue 1 year ago • 3 comments

Hi, I am unable to run the distributed train using the GPU using this mpirun --allow-run-as-root -n 2 python tools/train.py --config configs/det/dbnet/db_r50_icdar15.yaml. I knew the issue was on the OpenMPI, but my PC is Windows-based and OpenMPI is no longer supported on Windows based on my understanding. Do you have any advice to solve my issue?

ThomasLimWZ avatar Apr 02 '24 12:04 ThomasLimWZ

@ThomasLimWZ Hello, thanks for your feedback. As far as I know, mindspore' support for the Windows OS is incomplete. Please consider switching to the Linux OS.

As to the problem of running distributed training tasks, you can try the dynamic cluster startup method (refer to Distributed Parallel Startup Methods). MindSpore provides three distributed parallel startup methods (refer to Distributed Parallel Startup Methods), two of which support GPU.

panshaowu avatar Apr 03 '24 01:04 panshaowu

Hi, I tried to use Windows Subsystem for Linux to run this repository, and is already resolved the issue of OpenMPI. But currently, I'm still facing some issues with both standalone training and distributed training. It returned to me the error messages that said that the [mindspore/ccsrc/backend/common/mem_reuse/mem_dynamic_allocator.cc:303] CalMemBlockAllocSize] Memory not enough: current free memory size[0] is smaller than required size[262144000].

Can I know what is the minimum hardware requirements for this mindocr? FYI, my RAM size is 24GB and my GPU is Nvidia 3050Ti only.

ThomasLimWZ avatar Apr 04 '24 14:04 ThomasLimWZ

@ThomasLimWZ As far as I know, there is no MindSpore API to get the required RAM or graphics memory currently. But I am afraid that the 4GB graphic memory of 3050Ti GPU may be insufficient for training DBNet ResNet-50 with the default configurations. You can try to reduce the value of train.loader.batch_size and train.loader.num_workers in configs/det/dbnet/db_r50_icdar15.yaml. Also, you can try to switch to using DBNet ResNet-18.

panshaowu avatar Apr 09 '24 10:04 panshaowu