MinerU 分页并行以及模块化处理

从 MinerU 的底层代码来看，似乎每一页 PDF 都是一个独立的处理单元，使用简单的 for-loop 依次处理，不存在拼页凑 block 的步骤。

未来是否考虑加入并行处理的机制，分页后根据资源情况同时处理不同的页对象。最后再按照 page_index 拼接。

理论上可行，但我看了下调用和加载模型的逻辑，不管是协程，多线程还是多进程，在调用 paddle 模型的时候都会有问题。

否则目前的话，仅使用 Layout + OCR 按顺序处理每页，速度还是非常不理想的。（其实我尝试去重构了一下项目架构，但感觉所有东西都耦合到一起了，很多参数和模块对象应该直接 import 就用了，不应该从外部初始化后层层传递，深入到最底层的代码里，几乎每一个模块都用到了，这几乎没办法拆分成并行结构）

Sep 08 '24 13:09 QIN2DIM

然后是遇到了一些跟 GPU 资源利用相关的问题。

我在几台机器上跑了改过的MinerU-server - 一台A100服务器，还有台装了4080S和1050Ti的PC。没想到 4080S 的机器处理速度比 A100 开发机要快数倍。

一开始还以为A100用了CPU，后来仔细检查确定容器里CUDA没问题。A100确实用的GPU。

等会我把改过的 server pr 上来看看

Sep 09 '24 09:09 QIN2DIM

有没有可能a100在测试的时候还被其他人使用，并没有使用独享资源去运行mineru呢？

Sep 09 '24 09:09 myhloli

@myhloli 没，这台物理机就我一个人用，用 nvitop 盯着使用率，server 在初始化时就已经把所有模型加载到内存里了，GPU 占用都没到单卡的 16%。目前我还没测并发的情况。

Sep 09 '24 09:09 QIN2DIM

@myhloli 其实我一直想问个玄学的问题，就是 MinerU 现在的 online demo，处理 pdf 文档的速度大概是多少。

我在本地 PC 4080Super 测 avg 2.99page/s 也即，每秒钟处理 2.99 页，但同样的代码 build 成容器到 linux 开发机上运行，无论是单卡 V100 还是单卡 A100 运行速度都远远低于这个值。为了控制变量，我还直接运行了 magic-pdf cli 对同一个文件进行处理，结论一致。

Win32 软件包	版本
torch	2.3.1+cu118
torchvision	0.18.1+cu118
magic-pdf	0.7.1
paddlepaddle-gpu	2.6.1
paddleocr	2.7.3

Sep 10 '24 09:09 QIN2DIM

找到问题了，deivce 环境变量或配置没有覆盖到 ModifiedPaddleOCR()，但我显示传参后发现还是用不了，Namespace() 里 use_gpu 属性是 False。

λ c1df4be30d31 /app python
Python 3.10.14 (main, Apr  6 2024, 18:45:05) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from paddleocr import PaddleOCR
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script

>>> 
>>> ocr = PaddleOCR(use_gpu=True)
[2024/09/10 10:06:50] ppocr DEBUG: Namespace(help='==SUPPRESS==', use_gpu=False, use_xpu=False, use_npu=False, ir_optim=True, use_tensorrt=False, min_subgraph_size=15, precision='fp32', gpu_mem=500, gpu_id=0, image_dir=None, page_num=0, det_algorithm='DB', det_model_dir='/root/.paddleocr/whl/det/ch/ch_PP-OCRv4_det_infer', det_limit_side_len=960, det_limit_type='max', det_box_type='quad', det_db_thresh=0.3, det_db_box_thresh=0.6, det_db_unclip_ratio=1.5, max_batch_size=10, use_dilation=False, det_db_score_mode='fast', det_east_score_thresh=0.8, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_sast_score_thresh=0.5, det_sast_nms_thresh=0.2, det_pse_thresh=0, det_pse_box_thresh=0.85, det_pse_min_area=16, det_pse_scale=1, scales=[8, 16, 32], alpha=1.0, beta=1.0, fourier_degree=5, rec_algorithm='SVTR_LCNet', rec_model_dir='/root/.paddleocr/whl/rec/ch/ch_PP-OCRv4_rec_infer', rec_image_inverse=True, rec_image_shape='3, 48, 320', rec_batch_num=6, max_text_length=25, rec_char_dict_path='/usr/local/lib/python3.10/dist-packages/paddleocr/ppocr/utils/ppocr_keys_v1.txt', use_space_char=True, vis_font_path='./doc/fonts/simfang.ttf', drop_score=0.5, e2e_algorithm='PGNet', e2e_model_dir=None, e2e_limit_side_len=768, e2e_limit_type='max', e2e_pgnet_score_thresh=0.5, e2e_char_dict_path='./ppocr/utils/ic15_dict.txt', e2e_pgnet_valid_set='totaltext', e2e_pgnet_mode='fast', use_angle_cls=False, cls_model_dir='/root/.paddleocr/whl/cls/ch_ppocr_mobile_v2.0_cls_infer', cls_image_shape='3, 48, 192', label_list=['0', '180'], cls_batch_num=6, cls_thresh=0.9, enable_mkldnn=False, cpu_threads=10, use_pdserving=False, warmup=False, sr_model_dir=None, sr_image_shape='3, 32, 128', sr_batch_num=1, draw_img_save_dir='./inference_results', save_crop_res=False, crop_res_save_dir='./output', use_mp=False, total_process_num=1, process_id=0, benchmark=False, save_log_path='./log_output/', show_log=True, use_onnx=False, output='./output', table_max_len=488, table_algorithm='TableAttn', table_model_dir=None, merge_no_span_structure=True, table_char_dict_path=None, layout_model_dir=None, layout_dict_path=None, layout_score_threshold=0.5, layout_nms_threshold=0.5, kie_algorithm='LayoutXLM', ser_model_dir=None, re_model_dir=None, use_visual_backbone=True, ser_dict_path='../train_data/XFUND/class_list_xfun.txt', ocr_order_method=None, mode='structure', image_orientation=False, layout=True, table=True, ocr=True, recovery=False, use_pdf2docx_api=False, invert=False, binarize=False, alphacolor=(255, 255, 255), lang='ch', det=True, rec=True, type='ocr', ocr_version='PP-OCRv4', structure_version='PP-StructureV2')

Sep 10 '24 10:09 QIN2DIM

@myhloli 其实我一直想问个玄学的问题，就是 MinerU 现在的 online demo，处理 pdf 文档的速度大概是多少。

我在本地 PC 4080Super 测 avg 2.99page/s 也即，每秒钟处理 2.99 页，但同样的代码 build 成容器到 linux 开发机上运行，无论是单卡 V100 还是单卡 A100 运行速度都远远低于这个值。为了控制变量，我还直接运行了 magic-pdf cli 对同一个文件进行处理，结论一致。

Win32 软件包版本 torch 2.3.1+cu118 torchvision 0.18.1+cu118 magic-pdf 0.7.1 paddlepaddle-gpu 2.6.1 paddleocr 2.7.3

正常速度在1s/page左右吧，看公式复杂度，是否需要ocr，每页处理时间不等，如果没有什么公式，也不需要ocr，一秒两三页差不多

Sep 10 '24 10:09 myhloli

找到问题了，deivce 环境变量或配置没有覆盖到 ModifiedPaddleOCR()，但我显示传参后发现还是用不了，Namespace() 里 use_gpu 属性是 False。

λ c1df4be30d31 /app python
Python 3.10.14 (main, Apr  6 2024, 18:45:05) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from paddleocr import PaddleOCR
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script

>>> 
>>> ocr = PaddleOCR(use_gpu=True)
[2024/09/10 10:06:50] ppocr DEBUG: Namespace(help='==SUPPRESS==', use_gpu=False, use_xpu=False, use_npu=False, ir_optim=True, use_tensorrt=False, min_subgraph_size=15, precision='fp32', gpu_mem=500, gpu_id=0, image_dir=None, page_num=0, det_algorithm='DB', det_model_dir='/root/.paddleocr/whl/det/ch/ch_PP-OCRv4_det_infer', det_limit_side_len=960, det_limit_type='max', det_box_type='quad', det_db_thresh=0.3, det_db_box_thresh=0.6, det_db_unclip_ratio=1.5, max_batch_size=10, use_dilation=False, det_db_score_mode='fast', det_east_score_thresh=0.8, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_sast_score_thresh=0.5, det_sast_nms_thresh=0.2, det_pse_thresh=0, det_pse_box_thresh=0.85, det_pse_min_area=16, det_pse_scale=1, scales=[8, 16, 32], alpha=1.0, beta=1.0, fourier_degree=5, rec_algorithm='SVTR_LCNet', rec_model_dir='/root/.paddleocr/whl/rec/ch/ch_PP-OCRv4_rec_infer', rec_image_inverse=True, rec_image_shape='3, 48, 320', rec_batch_num=6, max_text_length=25, rec_char_dict_path='/usr/local/lib/python3.10/dist-packages/paddleocr/ppocr/utils/ppocr_keys_v1.txt', use_space_char=True, vis_font_path='./doc/fonts/simfang.ttf', drop_score=0.5, e2e_algorithm='PGNet', e2e_model_dir=None, e2e_limit_side_len=768, e2e_limit_type='max', e2e_pgnet_score_thresh=0.5, e2e_char_dict_path='./ppocr/utils/ic15_dict.txt', e2e_pgnet_valid_set='totaltext', e2e_pgnet_mode='fast', use_angle_cls=False, cls_model_dir='/root/.paddleocr/whl/cls/ch_ppocr_mobile_v2.0_cls_infer', cls_image_shape='3, 48, 192', label_list=['0', '180'], cls_batch_num=6, cls_thresh=0.9, enable_mkldnn=False, cpu_threads=10, use_pdserving=False, warmup=False, sr_model_dir=None, sr_image_shape='3, 32, 128', sr_batch_num=1, draw_img_save_dir='./inference_results', save_crop_res=False, crop_res_save_dir='./output', use_mp=False, total_process_num=1, process_id=0, benchmark=False, save_log_path='./log_output/', show_log=True, use_onnx=False, output='./output', table_max_len=488, table_algorithm='TableAttn', table_model_dir=None, merge_no_span_structure=True, table_char_dict_path=None, layout_model_dir=None, layout_dict_path=None, layout_score_threshold=0.5, layout_nms_threshold=0.5, kie_algorithm='LayoutXLM', ser_model_dir=None, re_model_dir=None, use_visual_backbone=True, ser_dict_path='../train_data/XFUND/class_list_xfun.txt', ocr_order_method=None, mode='structure', image_orientation=False, layout=True, table=True, ocr=True, recovery=False, use_pdf2docx_api=False, invert=False, binarize=False, alphacolor=(255, 255, 255), lang='ch', det=True, rec=True, type='ocr', ocr_version='PP-OCRv4', structure_version='PP-StructureV2')

你可以把环境里的paddlepaddle和paddlepaddle-gpu都卸载，然后安装paddlepaddle-gpu的3.0.0b1版本

Sep 10 '24 10:09 myhloli

在单卡L4上解析smallocr的耗时：

2024-09-10 12:22:02.167 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:245 - layout detection cost: 1.11

0: 1888x1312 (no detections), 98.9ms
Speed: 11.0ms preprocess, 98.9ms inference, 0.5ms postprocess per image at shape (1, 3, 1888, 1312)
2024-09-10 12:22:02.279 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:275 - formula nums: 0, mfr time: 0.0
2024-09-10 12:22:03.457 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:358 - ocr cost: 1.17
2024-09-10 12:22:04.663 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:245 - layout detection cost: 1.21

0: 1888x1312 4 embeddings, 62.8ms
Speed: 9.9ms preprocess, 62.8ms inference, 1.0ms postprocess per image at shape (1, 3, 1888, 1312)
2024-09-10 12:22:05.376 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:275 - formula nums: 4, mfr time: 0.6
2024-09-10 12:22:05.843 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:358 - ocr cost: 0.46
2024-09-10 12:22:06.907 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:245 - layout detection cost: 1.06

0: 1888x1312 (no detections), 63.1ms
Speed: 9.9ms preprocess, 63.1ms inference, 0.5ms postprocess per image at shape (1, 3, 1888, 1312)
2024-09-10 12:22:06.981 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:275 - formula nums: 0, mfr time: 0.0
2024-09-10 12:22:07.411 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:358 - ocr cost: 0.42
2024-09-10 12:22:08.507 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:245 - layout detection cost: 1.1

0: 1888x1312 (no detections), 63.0ms
Speed: 9.9ms preprocess, 63.0ms inference, 0.5ms postprocess per image at shape (1, 3, 1888, 1312)
2024-09-10 12:22:08.582 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:275 - formula nums: 0, mfr time: 0.0
2024-09-10 12:22:08.997 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:358 - ocr cost: 0.4
2024-09-10 12:22:10.035 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:245 - layout detection cost: 1.04

0: 1888x1312 (no detections), 63.0ms
Speed: 11.2ms preprocess, 63.0ms inference, 0.5ms postprocess per image at shape (1, 3, 1888, 1312)
2024-09-10 12:22:10.111 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:275 - formula nums: 0, mfr time: 0.0
2024-09-10 12:22:10.560 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:358 - ocr cost: 0.44
2024-09-10 12:22:11.705 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:245 - layout detection cost: 1.14

0: 1888x1312 (no detections), 63.0ms
Speed: 10.3ms preprocess, 63.0ms inference, 0.5ms postprocess per image at shape (1, 3, 1888, 1312)
2024-09-10 12:22:11.780 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:275 - formula nums: 0, mfr time: 0.0
2024-09-10 12:22:12.169 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:358 - ocr cost: 0.38
2024-09-10 12:22:13.290 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:245 - layout detection cost: 1.12

0: 1888x1312 3 embeddings, 63.1ms
Speed: 9.9ms preprocess, 63.1ms inference, 1.0ms postprocess per image at shape (1, 3, 1888, 1312)
2024-09-10 12:22:13.737 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:275 - formula nums: 3, mfr time: 0.34
2024-09-10 12:22:14.153 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:358 - ocr cost: 0.41
2024-09-10 12:22:15.212 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:245 - layout detection cost: 1.06

0: 1888x1312 (no detections), 63.1ms
Speed: 9.8ms preprocess, 63.1ms inference, 0.5ms postprocess per image at shape (1, 3, 1888, 1312)
2024-09-10 12:22:15.287 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:275 - formula nums: 0, mfr time: 0.0
2024-09-10 12:22:15.719 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:358 - ocr cost: 0.42
2024-09-10 12:22:15.719 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:136 - doc analyze cost: 14.665070295333862

在单卡A100上执行的耗时：

[2024-09-10 18:23:36] 
[2024-09-10 18:23:36] 0: 1888x1312 (no detections), 19.9ms
[2024-09-10 18:23:36] Speed: 12.2ms preprocess, 19.9ms inference, 0.4ms postprocess per image at shape (1, 3, 1888, 1312)
[2024-09-10 18:23:36] 2024-09-10 18:23:36.416 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:275 - formula nums: 0, mfr time: 0.0
[2024-09-10 18:23:36] 2024-09-10 18:23:36.774 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:358 - ocr cost: 0.34
[2024-09-10 18:23:37] 2024-09-10 18:23:37.723 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:358 - ocr cost: 0.35
[2024-09-10 18:23:37] 2024-09-10 18:23:37.320 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:245 - layout detection cost: 0.55
[2024-09-10 18:23:37] 
[2024-09-10 18:23:37] 0: 1888x1312 (no detections), 19.9ms
[2024-09-10 18:23:37] Speed: 12.6ms preprocess, 19.9ms inference, 0.5ms postprocess per image at shape (1, 3, 1888, 1312)
[2024-09-10 18:23:37] 2024-09-10 18:23:37.354 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:275 - formula nums: 0, mfr time: 0.0
[2024-09-10 18:23:38] 2024-09-10 18:23:38.316 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:245 - layout detection cost: 0.59
[2024-09-10 18:23:38] 
[2024-09-10 18:23:38] 0: 1888x1312 (no detections), 19.9ms
[2024-09-10 18:23:38] Speed: 13.1ms preprocess, 19.9ms inference, 0.4ms postprocess per image at shape (1, 3, 1888, 1312)
[2024-09-10 18:23:38] 2024-09-10 18:23:38.350 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:275 - formula nums: 0, mfr time: 0.0
[2024-09-10 18:23:38] 2024-09-10 18:23:38.710 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:358 - ocr cost: 0.34
[2024-09-10 18:23:39] 2024-09-10 18:23:39.293 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:245 - layout detection cost: 0.58
[2024-09-10 18:23:39] 
[2024-09-10 18:23:39] 0: 1888x1312 3 embeddings, 19.9ms
[2024-09-10 18:23:39] Speed: 12.2ms preprocess, 19.9ms inference, 0.9ms postprocess per image at shape (1, 3, 1888, 1312)
[2024-09-10 18:23:39] 2024-09-10 18:23:39.677 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:275 - formula nums: 3, mfr time: 0.3
[2024-09-10 18:23:40] 2024-09-10 18:23:40.036 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:358 - ocr cost: 0.34
[2024-09-10 18:23:40] 2024-09-10 18:23:40.594 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:245 - layout detection cost: 0.56
[2024-09-10 18:23:40] 
[2024-09-10 18:23:40] 0: 1888x1312 (no detections), 19.9ms
[2024-09-10 18:23:40] Speed: 12.4ms preprocess, 19.9ms inference, 0.5ms postprocess per image at shape (1, 3, 1888, 1312)
[2024-09-10 18:23:40] 2024-09-10 18:23:40.628 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:275 - formula nums: 0, mfr time: 0.0
[2024-09-10 18:23:41] 2024-09-10 18:23:41.001 | INFO     | magic_pdf.model.pdf_extract_kit:__call__:358 - ocr cost: 0.36
[2024-09-10 18:23:41] 2024-09-10 18:23:41.001 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:136 - doc analyze cost: 9.02635908126831

Sep 10 '24 10:09 myhloli

@myhloli

这是上面测试用的 23 页pdf:

AI供应链-英伟达Blackwell系列的最新动态.pdf

在 4080S 测得平均速度 avg 2.99page/s 的运行配置：

2024-09-10 17:26:33.117 | INFO     | methods:__init__:170 - self.apply_ocr=True self.apply_layout=True self.apply_table=False self.apply_formula=False

Win32 软件包	版本
torch	2.3.1+cu118
torchvision	0.18.1+cu118
magic-pdf	0.7.1
paddlepaddle-gpu	2.6.1
paddleocr	2.7.3

我等会重新弄一下服务器的配置

Sep 10 '24 12:09 QIN2DIM

@myhloli 还蛮玄学的，因为我 build 的镜像是从 paddle 构建的，环境里自带了 paddlepaddle 和 paddlepaddle-gpu，所以一直忽略了这个问题。

FROM registry.baidubce.com/paddlepaddle/paddle:3.0.0b1-gpu-cuda11.8-cudnn8.6-trt8.5

https://www.paddlepaddle.org.cn/

然后我找了个纯净的环境，仅运行了

pip install --dry-run magic-pdf[full]==0.7.1 --extra-index-url https://wheels.myhloli.com

发现依赖树里会默认安装 paddlepaddle 而非 paddlepaddle-gpu

Sep 11 '24 01:09 QIN2DIM

@myhloli 还蛮玄学的，因为我 build 的镜像是从 paddle 构建的，环境里自带了 paddlepaddle 和 paddlepaddle-gpu，所以一直忽略了这个问题。

docker pull registry.baidubce.com/paddlepaddle/paddle:3.0.0b1-gpu-cuda11.8-cudnn8.6-trt8.5

https://www.paddlepaddle.org.cn/

Sep 11 '24 01:09 QIN2DIM

@myhloli 还蛮玄学的，因为我 build 的镜像是从 paddle 构建的，环境里自带了 paddlepaddle 和 paddlepaddle-gpu，所以一直忽略了这个问题。
FROM registry.baidubce.com/paddlepaddle/paddle:3.0.0b1-gpu-cuda11.8-cudnn8.6-trt8.5
https://www.paddlepaddle.org.cn/

然后我找了个纯净的环境，仅运行了
pip install --dry-run magic-pdf[full]==0.7.1 --extra-index-url https://wheels.myhloli.com
发现依赖树里会默认安装 paddlepaddle 而非 paddlepaddle-gpu

paddle框架有安装先后顺序的要求，按教程先装依赖的cpu版本，再安装gpu版本，就可以正常使用gpu加速，你如果先安装gpu版本再安装mineru就会被cpu的paddle覆盖

Sep 11 '24 04:09 myhloli

所以这里比较有意思。因为这个基础镜像里面cpu版本以及gpu版本都有。我直接打开这个基础镜像，gpu是可用的。但是构建过程中运行了 pip install mineru 没想到会被覆盖掉，hhh

Sep 11 '24 11:09 QIN2DIM

1.3.0版本已移除paddle框架，并使用文件级和页面级的batch处理。

Apr 04 '25 04:04 myhloli

MinerU MinerU copied to clipboard

分页并行以及模块化处理

MinerU
MinerU copied to clipboard