MinerU 表格识别报错

Description of the bug | 错误描述

使用表格识别功能后报错： Traceback (most recent call last):

File "D:\wzh\MinerU-master\demo\magic_pdf_parse_main.py", line 136, in pdf_parse_main(pdf_path) │ └ 'D:/wzh/1.pdf' └ <function pdf_parse_main at 0x000001A5C6424310>

File "D:\wzh\MinerU-master\demo\magic_pdf_parse_main.py", line 121, in pdf_parse_main content_list = pipe.pipe_mk_uni_format(image_path_parent, drop_mode="none") │ │ └ 'images' │ └ <function UNIPipe.pipe_mk_uni_format at 0x000001A5FD019EA0> └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x000001A5FCFFD390>

File "D:\wzh\MinerU-master\magic_pdf\pipe\UNIPipe.py", line 42, in pipe_mk_uni_format result = super().pipe_mk_uni_format(img_parent_path, drop_mode) │ └ 'none' └ 'images'

File "D:\wzh\MinerU-master\magic_pdf\pipe\AbsPipe.py", line 51, in pipe_mk_uni_format content_list = AbsPipe.mk_uni_format(self.get_compress_pdf_mid_data(), img_parent_path, drop_mode) │ │ │ │ │ └ 'none' │ │ │ │ └ 'images' │ │ │ └ <function AbsPipe.get_compress_pdf_mid_data at 0x000001A5E2D21510> │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x000001A5FCFFD390> │ └ <staticmethod(<function AbsPipe.mk_uni_format at 0x000001A5E2D21900>)> └ <class 'magic_pdf.pipe.AbsPipe.AbsPipe'>

File "D:\wzh\MinerU-master\magic_pdf\pipe\AbsPipe.py", line 94, in mk_uni_format content_list = union_make(pdf_info_list, MakeMode.STANDARD_FORMAT, drop_mode, img_buket_path) │ │ │ │ │ └ 'images' │ │ │ │ └ 'none' │ │ │ └ 'standard_format' │ │ └ <class 'magic_pdf.libs.MakeContentConfig.MakeMode'> │ └ [{'preproc_blocks': [{'type': 'title', 'bbox': [170, 131, 373, 155], 'lines': [{'bbox': [171.60202026367188, 134.119171142578... └ <function union_make at 0x000001A5E088D000>

File "D:\wzh\MinerU-master\magic_pdf\dict2md\ocr_mkcontent.py", line 371, in union_make para_content = para_to_standard_format_v2(para_block, img_buket_path, page_idx) │ │ │ └ 0 │ │ └ 'images' │ └ {'type': 'table', 'bbox': [55, 500, 487, 551], 'blocks': [{'bbox': [55, 515, 487, 551], 'type': 'table_body', 'lines': [{'bbo... └ <function para_to_standard_format_v2 at 0x000001A5E088CDC0>

File "D:\wzh\MinerU-master\magic_pdf\dict2md\ocr_mkcontent.py", line 258, in para_to_standard_format_v2 para_content['table_body'] = f"\n\n$\n {block['lines'][0]['spans'][0]['content']}\n$\n\n" └ {'type': 'table', 'page_idx': 0}

KeyError: 'content' 1.pdf

How to reproduce the bug | 如何复现

直接运行的magic_pdf_parse_main.py

Operating system | 操作系统

Windows

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cpu

Aug 05 '24 02:08 2257396011

是今天更新的版本支持表格内容提取了吗

Aug 05 '24 02:08 JiangRunzhi

是今天更新的版本支持表格内容提取了吗

是的

Aug 05 '24 02:08 2257396011

@papayalove

Aug 05 '24 02:08 myhloli

是今天更新的版本支持表格内容提取了吗

是的

谢谢！求教一下我安装了0.6.2b1版本，为什么输出的markdown里面表格还是图片形式的呢？我修改magic-pdf.json中的"is_table_recog_enable": true, 也没作用。求助大佬

Aug 05 '24 06:08 JiangRunzhi

是今天更新的版本支持表格内容提取了吗

是的

谢谢！求教一下我安装了0.6.2b1版本，为什么输出的markdown里面的表格还是图片形式的呢？我修改magic-pdf.json中的"is_table_recog_enable": true,也没有作用。求助大佬

大佬刚才说了现在还是用不了呢，只能等0.7.x版本了

Aug 05 '24 07:08 2257396011

bug已修复

Aug 05 '24 09:08 papayalove