MinerU
MinerU copied to clipboard
完全不正常的CPU调用逻辑
Description of the bug | 错误描述
运行时程序只在调用小核心而不是大核
导致处理速度极慢,十多分钟了还没处理好
使用Intel i7-14650HX
How to reproduce the bug | 如何复现
使用仓库中提供的demo.py 使用Conda虚拟环境 Magic-PDF[Full] 命令行执行以下内容: (MinerU) PS E:\Jason Yang's Materials\070 Technology - Programming\07007 Python - MinerU> python ./demo/demo.py
demo.py没有更改任何内容直接执行
import os
import json
from loguru import logger
from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
import magic_pdf.model as model_config
model_config.__use_inside_model__ = True
try:
current_script_dir = os.path.dirname(os.path.abspath(__file__))
demo_name = "demo1"
pdf_path = os.path.join(current_script_dir, f"{demo_name}.pdf")
model_path = os.path.join(current_script_dir, f"{demo_name}.json")
pdf_bytes = open(pdf_path, "rb").read()
# model_json = json.loads(open(model_path, "r", encoding="utf-8").read())
model_json = [] # model_json传空list使用内置模型解析
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
local_image_dir = os.path.join(current_script_dir, 'images')
image_dir = str(os.path.basename(local_image_dir))
image_writer = DiskReaderWriter(local_image_dir)
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
"""如果没有传入有效的模型数据,则使用内置model解析"""
if len(model_json) == 0:
if model_config.__use_inside_model__:
pipe.pipe_analyze()
else:
logger.error("need model list input")
exit(1)
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
with open(f"{demo_name}.md", "w", encoding="utf-8") as f:
f.write(md_content)
except Exception as e:
logger.exception(e)
magic-pdf.json如下:
{
"bucket_info": {
"bucket-name-1": ["ak", "sk", "endpoint"],
"bucket-name-2": ["ak", "sk", "endpoint"]
},
"temp-output-dir": "/outputs",
"models-dir": "./PDF-Extract-Kit/models",
"device-mode": "cpu"
}
Operating system | 操作系统
Windows
Python version | Python 版本
3.10
Software version | 软件版本 (magic-pdf --version)
0.6.x
Device mode | 设备模式
cpu