PaddleOCR
PaddleOCR copied to clipboard
ppstructure的中文文字识别结果是unicode字符。
🔎 Search before asking
- [X] I have searched the PaddleOCR Docs and found no similar bug report.
- [X] I have searched the PaddleOCR Issues and found no similar bug report.
- [X] I have searched the PaddleOCR Discussions and found no similar bug report.
🐛 Bug (问题描述)
ppocr2.8.1,ppstructure的中文文字识别结果是unicode字符(保存的res文件,print的result均是)。
我搜索了,前一阵子有类似问题,但都没有解决。 https://github.com/PaddlePaddle/PaddleOCR/issues/10790
🏃♂️ Environment (运行环境)
win11,gtx3080
🌰 Minimal Reproducible Example (最小可复现问题的Demo)
import os import cv2 from paddleocr import PPStructure,save_structure_res from paddle.utils import try_import import numpy as np from PIL import Image from paddleocr.ppstructure.recovery.recovery_to_doc import sorted_layout_boxes, convert_info_docx
中文测试图
table_engine = PPStructure(recovery=True)
英文测试图
table_engine = PPStructure(recovery=True, lang='en')
save_folder = 'E:\Workspace\GitHub\PaddleOCR-2.8.1\output' img_path = 'E:\合同.pdf'
fitz = try_import("fitz") imgs = [] with fitz.open(img_path) as pdf: for pg in range(0, pdf.page_count): page = pdf[pg] mat = fitz.Matrix(2, 2) pm = page.get_pixmap(matrix=mat, alpha=False)
# if width or height > 2000 pixels, don't enlarge the image
if pm.width > 2000 or pm.height > 2000:
pm = page.get_pixmap(matrix=fitz.Matrix(1, 1), alpha=False)
img = Image.frombytes("RGB", [pm.width, pm.height], pm.samples)
img = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)
imgs.append(img)
for index, img in enumerate(imgs): result = table_engine(img) save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0], index) for line in result: line.pop('img') print(line) h, w, _ = img.shape res = sorted_layout_boxes(result, w)