MinerU
MinerU copied to clipboard
The model repeatedly initializes when processing multiple PDFs in a single process, and it does not implement a singleton pattern.
Description of the bug | 错误描述
The model repeatedly initializes when processing multiple PDFs in a single process, and it does not implement a singleton pattern.
How to reproduce the bug | 如何复现
for ii, pdf_info in enumerate(all_input_jsonl_lines): # 获取到属于这个GPU的切片
track_id = pdf_info['track_id']
temp_json_save_file = os.path.join(temp_json_save_path, f"{track_id}.json") # 一本书临时保存到本地的json文件
# 检查本地是否已经存在了
if os.path.exists(temp_json_save_file):
logger.info(f"{temp_json_save_file} already exists, skip.")
continue
s3_pdf_path = pdf_info['path']
s3_pdf_client = get_s3_cli_from_pool(s3_pdf_path)
# 读取pdf文件到内存里
pdf_bytes = get_pdf_bytes(s3_pdf_path, s3_pdf_client)
magicpdf = UNIPipe(pdf_bytes, {"_pdf_type":"", "model_list":[]}, image_writer=None)
# fitz 获取页码数
doc = fitz.open(stream=pdf_bytes, filetype="pdf")
page_count = doc.page_count
doc.close()
extract_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
try:
magicpdf.pipe_classify()
magicpdf.pipe_analyze()
doc_layout_result = magicpdf.model_list
pdf_info["doc_layout_result"] = doc_layout_result
except Exception as e:
logger.exception(e)
err_info = str(e)
__set_extra_info(pdf_info, "__error", err_info)
__set_extra_info(pdf_info, "__inference_datetime", extract_time)
__set_extra_info(pdf_info, "__mineru_inference_version", magic_pdf_version.__version__)
#outputs.append(pdf_info)
logger.info(f"processed {ii}/{total_pdfs} pdfs")
###################################################
## 保存这个pdf的结果到本地文件里,等整个json在每块GPU上都处理完全,之后一次上传到ceph
###################################################
with open(temp_json_save_file,'w') as ff:
ff.write(json.dumps(pdf_info, ensure_ascii=False))
Operating system | 操作系统
Linux
Python version | Python 版本
3.10
Software version | 软件版本 (magic-pdf --version)
0.7.x
Device mode | 设备模式
cuda