MinerU icon indicating copy to clipboard operation
MinerU copied to clipboard

完全不正常的CPU调用逻辑

Open Jason-JP-Yang opened this issue 7 months ago • 4 comments

Description of the bug | 错误描述

运行时程序只在调用小核心而不是大核 image 导致处理速度极慢,十多分钟了还没处理好 使用Intel i7-14650HX

How to reproduce the bug | 如何复现

使用仓库中提供的demo.py 使用Conda虚拟环境 Magic-PDF[Full] 命令行执行以下内容: (MinerU) PS E:\Jason Yang's Materials\070 Technology - Programming\07007 Python - MinerU> python ./demo/demo.py

demo.py没有更改任何内容直接执行

import os
import json

from loguru import logger

from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter

import magic_pdf.model as model_config 
model_config.__use_inside_model__ = True

try:
    current_script_dir = os.path.dirname(os.path.abspath(__file__))
    demo_name = "demo1"
    pdf_path = os.path.join(current_script_dir, f"{demo_name}.pdf")
    model_path = os.path.join(current_script_dir, f"{demo_name}.json")
    pdf_bytes = open(pdf_path, "rb").read()
    # model_json = json.loads(open(model_path, "r", encoding="utf-8").read())
    model_json = []  # model_json传空list使用内置模型解析
    jso_useful_key = {"_pdf_type": "", "model_list": model_json}
    local_image_dir = os.path.join(current_script_dir, 'images')
    image_dir = str(os.path.basename(local_image_dir))
    image_writer = DiskReaderWriter(local_image_dir)
    pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
    pipe.pipe_classify()
    """如果没有传入有效的模型数据,则使用内置model解析"""
    if len(model_json) == 0:
        if model_config.__use_inside_model__:
            pipe.pipe_analyze()
        else:
            logger.error("need model list input")
            exit(1)
    pipe.pipe_parse()
    md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
    with open(f"{demo_name}.md", "w", encoding="utf-8") as f:
        f.write(md_content)
except Exception as e:
    logger.exception(e)

magic-pdf.json如下:

{
    "bucket_info": {
        "bucket-name-1": ["ak", "sk", "endpoint"],
        "bucket-name-2": ["ak", "sk", "endpoint"]
    },
    "temp-output-dir": "/outputs",
    "models-dir": "./PDF-Extract-Kit/models",
    "device-mode": "cpu"
}

Operating system | 操作系统

Windows

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cpu

Jason-JP-Yang avatar Jul 30 '24 20:07 Jason-JP-Yang