PaddleOCR 微调ch_PP-OCRv4_det_server_train，训练时评估模型显示out of memory

微调ch_PP-OCRv4_det_server_train，训练时评估模型显示out of memory

Open ly03240921 opened this issue 6 months ago • 7 comments

🔎 Search before asking

[X] I have searched the PaddleOCR Docs and found no similar bug report.
[X] I have searched the PaddleOCR Issues and found no similar bug report.
[X] I have searched the PaddleOCR Discussions and found no similar bug report.

🐛 Bug (问题描述)

[2024/08/27 19:14:23] ppocr INFO: epoch: [5/500], global_step: 10, lr: 0.001000, loss: 2.168079, loss_shrink_maps: 1.022120, loss_threshold_maps: 0.760488, loss_binary_maps: 0.204714, loss_cbn: 0.204714, avg_reader_cost: 0.03694 s, avg_batch_cost: 0.04500 s, avg_samples: 0.12, ips: 2.66682 samples/s, eta: 0:41:51, max_mem_reserved: 13909 MB, max_mem_allocated: 11894 MB eval model:: 0%| | 0/4 [00:00<?, ?it/s]Traceback (most recent call last): File "/app/ocr/PaddleOCR-release-2.8/tools/train.py", line 257, in main(config, device, logger, vdl_writer, seed) File "/app/ocr/PaddleOCR-release-2.8/tools/train.py", line 209, in main program.train( File "/app/ocr/PaddleOCR-release-2.8/tools/program.py", line 452, in train cur_metric = eval( File "/app/ocr/PaddleOCR-release-2.8/tools/program.py", line 622, in eval preds = model(images) File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call return self.forward(*inputs, **kwargs) File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/architectures/base_model.py", line 99, in forward x = self.head(x, targets=data) File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call return self.forward(*inputs, **kwargs) File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/heads/det_db_head.py", line 145, in forward cbn_maps = self.cbn_layer(self.up_conv(f), shrink_maps, None) File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call return self.forward(*inputs, **kwargs) File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/heads/det_db_head.py", line 127, in forward out = self.last_1(self.last_3(outf)) File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call return self.forward(*inputs, **kwargs) File "/app/ocr/PaddleOCR-release-2.8/ppocr/modeling/backbones/det_mobilenet_v3.py", line 186, in forward x = self.conv(x) File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/layers.py", line 1429, in call return self.forward(*inputs, **kwargs) File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/layer/conv.py", line 715, in forward out = F.conv._conv_nd( File "/home/anaconda3/envs/pd-ocr/lib/python3.10/site-packages/paddle/nn/functional/conv.py", line 128, in _conv_nd pre_bias = _C_ops.conv2d( MemoryError:

C++ Traceback (most recent call last):

0 paddle::pybind::eager_api_conv2d(_object*, _object*, _object*) 1 conv2d_ad_func(paddle::Tensor const&, paddle::Tensor const&, std::vector<int, std::allocator >, std::vector<int, std::allocator >, std::string, std::vector<int, std::allocator >, int, std::string) 2 paddle::experimental::conv2d(paddle::Tensor const&, paddle::Tensor const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::string const&, std::vector<int, std::allocator > const&, int, std::string const&) 3 void phi::ConvCudnnKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, std::vector<int, std::allocator > const&, std::vector<int, std::allocator > const&, std::string const&, std::vector<int, std::allocator > const&, int, std::string const&, phi::DenseTensor*) 4 float* phi::DeviceContext::Alloc(phi::TensorBase*, unsigned long, bool) const 5 phi::DeviceContext::Impl::Alloc(phi::TensorBase*, phi::Place const&, phi::DataType, unsigned long, bool, bool) const 6 phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool) 7 paddle::memory::allocation::Allocator::Allocate(unsigned long) 8 paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long) 9 paddle::memory::allocation::Allocator::Allocate(unsigned long) 10 paddle::memory::allocation::Allocator::Allocate(unsigned long) 11 paddle::memory::allocation::Allocator::Allocate(unsigned long) 12 paddle::memory::allocation::Allocator::Allocate(unsigned long) 13 paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long) 14 std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int) 15 phi::enforce::GetCurrentTraceBackStringabi:cxx11

Error Message Summary:

ResourceExhaustedError:

Out of memory error on GPU 1. Cannot allocate 3.158203GB memory on GPU 1, 13.315369GB memory has been allocated and available memory is only 2.386902GB.

Please check whether there is any other process using GPU 1.

If yes, please stop them, or start PaddlePaddle on another GPU.
If no, please decrease the batch size of your model. (at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:86)

🏃‍♂️ Environment (运行环境)

PaddlePaddle-gpu：2.6 PaddleOCR：2.8 RAM：16G

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

python tools/train.py -c configs/det/ch_PP-OCRv4/ch_PP-OCRv4_det_teacher.yml

Aug 27 '24 12:08 ly03240921

PaddleOCR PaddleOCR copied to clipboard

微调ch_PP-OCRv4_det_server_train，训练时评估模型显示out of memory

🔎 Search before asking

🐛 Bug (问题描述)

C++ Traceback (most recent call last):

Error Message Summary:

🏃‍♂️ Environment (运行环境)

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

PaddleOCR
PaddleOCR copied to clipboard