MinerU icon indicating copy to clipboard operation
MinerU copied to clipboard

表格识别速度慢、关键是效果很不好,基本处于不可用状态

Open dafen12 opened this issue 1 year ago • 9 comments

Description of the bug | 错误描述

companies-list.pdf 5分钟识别结果如下,latex解析效果差: $ begin{tabular}{cccccccc}\multicolumn{2}{c}{\textit{xexcuracy}} & \multicolumn{6}{c}{\textit{Fuge. $z$}} \[2mm]\multicolumn{2}{c}{\textbf{GROUP NAME}} & \multicolumn{1}{c}{\textbf{GROUP}} & \multicolumn{1}{c}{\textbf{CO NO}} & \multicolumn{1}{c}{\textbf{SIMT}} & \multicolumn{1}{c}{\textbf{STATIS}} & \multicolumn{1}{c}{\textbf{ST}} & \multicolumn{1}{c}{\textbf{COMPANY NAME}} \[2mm] & CVS,GRP & 1 & 1827 & X & 1 & PA & AEINA,HLTH,ASSUR RANC \[2mm] & 9938 & X & 1 & CT & AEINA,HLTH,INC CT CORP \[2mm] & 9088 & X & 1 & FL & AEINA,HLTH,INC FL CORP \[2mm] & 95994 & X & 1 & GA & AEINA,HLTH,INC GA CORP \[2mm] & 9817 & X & 1 & ME & AFINA,HLTH,INC ME CORP \[2mm] & 9527 & X & 1 & NJ & AFINA,HLTH,INC NI CORP \[2mm] & 9524 & X & 1 & NY & AEINA,HLTH,INC NY CORP \[2mm] & 9510 & X & 1 & PA & AEINA,HLTH,INC PA CORP \[2mm] & 95860 & X & 1 & TX & AEINA,HLTH,INC TX CORP \[2mm] & 7082 & X & 1 & PA & AEINA,HLTH,INS CO \[2mm] & 8480 & X & 1 & NY & AEINA,HLTH,INS CO OF NY \[2mm] & 9524 & X & 1 & IA & AEINA,HLTH,OF IA INC \[2mm] & 9575 & X & 1 & MI & AEINA,HLTH,OF MI INC \[2mm] & 1808 & X & 1 & OH & AEINA,HLTH,OF OH INC \[2mm] & 9547 & X & 1 & UT & AFINA,HLTH,OF UTANINC \[2mm] & 6084 & L & 1 & CT & AEINA,LIFE INS CO \[2mm] & 17852 & X & 1 & MN & ALINA,HLTH,& AFINA,HLTH,PLANINC \[2mm] & 1694 & X & 1 & MN & ALINA,HLTH,& AEINA,INS CO \[2mm] & 1221 & L & 1 & TN & AMERICAN,CONTINTAL INS CO \[2mm] & 1088 & X & 1 & AZ & BANNER,HLTH,& ASTNA,HLTH,INS CO \[2mm] & 168 & X & 1 & AZ & BANNER,HLTH,& AEINA,HLTH,PLANINC \[2mm] & 68500 & L & 1 & TN & CONTINTAL,LIFE INS CO BENTwood \[2mm] & 81973 & X & 1 & MO & CoverNITY,HLTH,& LIFE INS CO \[2mm] & 74160 & X & 1 & IL & CoverNITY,HLTH,CARE OF IL INC \[2mm] & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & $

How to reproduce the bug | 如何复现

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cuda

dafen12 avatar Aug 06 '24 03:08 dafen12

这个是还没识别完成,你把max_time 调大,最后要出现end{tabular}才算识别完成。表格识别我们是建议有cuda的机器使用的,你有把device设置成cuda吗,识别时长的日志截图看下。正常一个表格用cuda 100秒以内就可以完成

papayalove avatar Aug 06 '24 03:08 papayalove

是GPU的,400w的A100,按理说性能可以了,但是一个表格的识别耗时还是这么长,我有看nvidia-smi,96%gpu利用。可以帮忙测试一下这个文档吗

dafen12 avatar Aug 06 '24 03:08 dafen12

这个是还没识别完成,你把max_time 调大,最后要出现end{tabular}才算识别完成。表格识别我们是建议有cuda的机器使用的,你有把device设置成cuda吗,识别时长的日志截图看下。正常一个表格用cuda 100秒以内就可以完成

dafen12 avatar Aug 06 '24 03:08 dafen12

image

dafen12 avatar Aug 06 '24 05:08 dafen12

我把你这个pdf转png进行了一下表格识别,latex代码如下\begin{tabular}{p{6cm}ccccl}\\\hline \multicolumn{5}{c}{\bf Data Year 2023} \\ [-0.25cm]\multicolumn{5}{c}{\bf December 21, 2023} \\ [0.2cm]\hline \multicolumn{5}{c}{\bf Group name} \\ [0.2cm]\hline \\ [-0.25cm]\multirow{2}{5cm}{\bf GroUP NamE} & \multirow{1}{*}{\bf Group} & \multirow{1}{*}{\bf CO NO} & \multirow{1}{*}{\bf STM} & \multicolumn{1}{c}{\bf STATIS} & \multirow{1}{*}{\bf ST} & \multicolumn{1}{c}{\bf COMPANY NamE} \\ [0.2cm]\\\hline \\ [-0.25cm]CVS GRP & 1 & 1827 & X & 1 & PA & AEINA HLTH ASRER PA INC \\ [0.2cm]\\& 95938 & X & 1 & CT & AEINA HLTH INC CT Corp \\ [0.2cm]& 99088 & X & 1 & FL & AEINA HLTH INC FL Corp \\ [0.2cm]& 95994 & X & 1 & GA & AEINA HLTH INC GA Corp \\ [0.2cm]& 95517 & X & 1 & ME & AEINA HLTH INC ME Corp \\ [0.2cm]& 95287 & X & 1 & NJ & AEINA HLTH INC NY Corp \\ [0.2cm]& 95234 & X & 1 & NY & AFINA HLTH INC NY \\ [0.2cm]& 95109 & X & 1 & PA & ATINA HLTH Inc PA Corp \\ [0.2cm]& 98390 & X & 1 & TX & AEINA HLTH Inc TX Corp \\ [0.2cm]& 72032 & X & 1 & PA & AEINA HLTH NSC CO \\ [0.2cm]& 84450 & X & 1 & NY & AFINA HLTH NSC CO OF NY \\ [0.2cm]& 9524 & X & 1 & IA & AEINA HLTH OF IA INC \\ [0.2cm]& 95756 & X & 1 & MI & AEINA HLTH OF MI INC \\ [0.2cm]& 11805 & X & 1 & OH & AEINA HLTH OF OH INC \\ [0.2cm]& 98407 & X & 1 & UT & AEINA HLTH OF UTH INC \\ [0.2cm]& 6064 & L & 1 & CT & AEINA LIFE INS CO \\ [0.2cm]& 17332 & X & 1 & MN & ALINA HLTH \& AEINA HLTH PLAN INC \\ [0.2cm]& 16194 & X & 1 & MN & ALINA HLTH \& AEINA HLTH PLAN INC \\ [0.2cm]& 16824 & L & 1 & TN & ALINA HLTH \& AEINA ILS CO \\ [0.2cm]& 12321 & L & 1 & TN & AMERican COMP 1,1NS CO \\ [0.2cm]& 16088 & X & 1 & AZ & BANNER HLTH \& AEINA HLTH NSC CO \\ [0.2cm]& 16099 & X & 1 & AZ & BANNER HLTH \& AERNA HLTH PLAN INC \\ [0.2cm]& 68800 & L & 1 & TN & COMPA NTALA LIFE INS CO BRENTWOOD \\ [0.2cm]& 81978 & X & 1 & MO & COMPNTRY HLTH \& LIFE INS CO \\ [0.2cm]& 74160 & X & 1 & IL & CoverNTRY HLTH CARE OF IL INC \\ [0.2cm]\\\hline \\ [-0.25cm]\end{tabular}, 你可以用这个demo试一下

sky-fly97 avatar Aug 06 '24 06:08 sky-fly97

我把你这个pdf转png进行了一下表格识别,latex代码如下, 你可以用这个demo试一下\begin{tabular}{p{6cm}ccccl}\\\hline \multicolumn{5}{c}{\bf Data Year 2023} \\ [-0.25cm]\multicolumn{5}{c}{\bf December 21, 2023} \\ [0.2cm]\hline \multicolumn{5}{c}{\bf Group name} \\ [0.2cm]\hline \\ [-0.25cm]\multirow{2}{5cm}{\bf GroUP NamE} & \multirow{1}{*}{\bf Group} & \multirow{1}{*}{\bf CO NO} & \multirow{1}{*}{\bf STM} & \multicolumn{1}{c}{\bf STATIS} & \multirow{1}{*}{\bf ST} & \multicolumn{1}{c}{\bf COMPANY NamE} \\ [0.2cm]\\\hline \\ [-0.25cm]CVS GRP & 1 & 1827 & X & 1 & PA & AEINA HLTH ASRER PA INC \\ [0.2cm]\\& 95938 & X & 1 & CT & AEINA HLTH INC CT Corp \\ [0.2cm]& 99088 & X & 1 & FL & AEINA HLTH INC FL Corp \\ [0.2cm]& 95994 & X & 1 & GA & AEINA HLTH INC GA Corp \\ [0.2cm]& 95517 & X & 1 & ME & AEINA HLTH INC ME Corp \\ [0.2cm]& 95287 & X & 1 & NJ & AEINA HLTH INC NY Corp \\ [0.2cm]& 95234 & X & 1 & NY & AFINA HLTH INC NY \\ [0.2cm]& 95109 & X & 1 & PA & ATINA HLTH Inc PA Corp \\ [0.2cm]& 98390 & X & 1 & TX & AEINA HLTH Inc TX Corp \\ [0.2cm]& 72032 & X & 1 & PA & AEINA HLTH NSC CO \\ [0.2cm]& 84450 & X & 1 & NY & AFINA HLTH NSC CO OF NY \\ [0.2cm]& 9524 & X & 1 & IA & AEINA HLTH OF IA INC \\ [0.2cm]& 95756 & X & 1 & MI & AEINA HLTH OF MI INC \\ [0.2cm]& 11805 & X & 1 & OH & AEINA HLTH OF OH INC \\ [0.2cm]& 98407 & X & 1 & UT & AEINA HLTH OF UTH INC \\ [0.2cm]& 6064 & L & 1 & CT & AEINA LIFE INS CO \\ [0.2cm]& 17332 & X & 1 & MN & ALINA HLTH \& AEINA HLTH PLAN INC \\ [0.2cm]& 16194 & X & 1 & MN & ALINA HLTH \& AEINA HLTH PLAN INC \\ [0.2cm]& 16824 & L & 1 & TN & ALINA HLTH \& AEINA ILS CO \\ [0.2cm]& 12321 & L & 1 & TN & AMERican COMP 1,1NS CO \\ [0.2cm]& 16088 & X & 1 & AZ & BANNER HLTH \& AEINA HLTH NSC CO \\ [0.2cm]& 16099 & X & 1 & AZ & BANNER HLTH \& AERNA HLTH PLAN INC \\ [0.2cm]& 68800 & L & 1 & TN & COMPA NTALA LIFE INS CO BRENTWOOD \\ [0.2cm]& 81978 & X & 1 & MO & COMPNTRY HLTH \& LIFE INS CO \\ [0.2cm]& 74160 & X & 1 & IL & CoverNTRY HLTH CARE OF IL INC \\ [0.2cm]\\\hline \\ [-0.25cm]\end{tabular}

image 这个latex转成markdwon,或者html依然很奇怪

dafen12 avatar Aug 06 '24 06:08 dafen12

这个是还没识别完成,你把max_time 调大,最后要出现end{tabular}才算识别完成。表格识别我们是建议有cuda的机器使用的,你有把device设置成cuda吗,识别时长的日志截图看下。正常一个表格用cuda 100秒以内就可以完成

一个表格的提取就这么费吗...一篇文档一般有3-5个,这个解析时间有点恐怖

xsank avatar Aug 06 '24 06:08 xsank

我把你这个pdf转png进行了一下表格识别,latex代码如下, 你可以用这个demo试一下\begin{tabular}{p{6cm}ccccl}\\\hline \multicolumn{5}{c}{\bf Data Year 2023} \\ [-0.25cm]\multicolumn{5}{c}{\bf December 21, 2023} \\ [0.2cm]\hline \multicolumn{5}{c}{\bf Group name} \\ [0.2cm]\hline \\ [-0.25cm]\multirow{2}{5cm}{\bf GroUP NamE} & \multirow{1}{*}{\bf Group} & \multirow{1}{*}{\bf CO NO} & \multirow{1}{*}{\bf STM} & \multicolumn{1}{c}{\bf STATIS} & \multirow{1}{*}{\bf ST} & \multicolumn{1}{c}{\bf COMPANY NamE} \\ [0.2cm]\\\hline \\ [-0.25cm]CVS GRP & 1 & 1827 & X & 1 & PA & AEINA HLTH ASRER PA INC \\ [0.2cm]\\& 95938 & X & 1 & CT & AEINA HLTH INC CT Corp \\ [0.2cm]& 99088 & X & 1 & FL & AEINA HLTH INC FL Corp \\ [0.2cm]& 95994 & X & 1 & GA & AEINA HLTH INC GA Corp \\ [0.2cm]& 95517 & X & 1 & ME & AEINA HLTH INC ME Corp \\ [0.2cm]& 95287 & X & 1 & NJ & AEINA HLTH INC NY Corp \\ [0.2cm]& 95234 & X & 1 & NY & AFINA HLTH INC NY \\ [0.2cm]& 95109 & X & 1 & PA & ATINA HLTH Inc PA Corp \\ [0.2cm]& 98390 & X & 1 & TX & AEINA HLTH Inc TX Corp \\ [0.2cm]& 72032 & X & 1 & PA & AEINA HLTH NSC CO \\ [0.2cm]& 84450 & X & 1 & NY & AFINA HLTH NSC CO OF NY \\ [0.2cm]& 9524 & X & 1 & IA & AEINA HLTH OF IA INC \\ [0.2cm]& 95756 & X & 1 & MI & AEINA HLTH OF MI INC \\ [0.2cm]& 11805 & X & 1 & OH & AEINA HLTH OF OH INC \\ [0.2cm]& 98407 & X & 1 & UT & AEINA HLTH OF UTH INC \\ [0.2cm]& 6064 & L & 1 & CT & AEINA LIFE INS CO \\ [0.2cm]& 17332 & X & 1 & MN & ALINA HLTH \& AEINA HLTH PLAN INC \\ [0.2cm]& 16194 & X & 1 & MN & ALINA HLTH \& AEINA HLTH PLAN INC \\ [0.2cm]& 16824 & L & 1 & TN & ALINA HLTH \& AEINA ILS CO \\ [0.2cm]& 12321 & L & 1 & TN & AMERican COMP 1,1NS CO \\ [0.2cm]& 16088 & X & 1 & AZ & BANNER HLTH \& AEINA HLTH NSC CO \\ [0.2cm]& 16099 & X & 1 & AZ & BANNER HLTH \& AERNA HLTH PLAN INC \\ [0.2cm]& 68800 & L & 1 & TN & COMPA NTALA LIFE INS CO BRENTWOOD \\ [0.2cm]& 81978 & X & 1 & MO & COMPNTRY HLTH \& LIFE INS CO \\ [0.2cm]& 74160 & X & 1 & IL & CoverNTRY HLTH CARE OF IL INC \\ [0.2cm]\\\hline \\ [-0.25cm]\end{tabular}

image 这个latex转成markdwon,或者html依然很奇怪

模型的输出目前是只支持latex,转成markdown或者html的效果只能取决于pypandoc这个库了,可能一些长的复杂的他就转不了了

sky-fly97 avatar Aug 06 '24 07:08 sky-fly97

这个是还没识别完成,你把max_time 调大,最后要出现end{tabular}才算识别完成。表格识别我们是建议有cuda的机器使用的,你有把device设置成cuda吗,识别时长的日志截图看下。正常一个表格用cuda 100秒以内就可以完成

一个表格的提取就这么费吗...一篇文档一般有3-5个,这个解析时间有点恐怖

Sorry for the misunderstanding in the previous description. Based on the issue's results, it should be considered a failure case. However, after testing the crop image from pdf on our code , most results of latex code are correct, and it took 40s on GPU A100. For inference speed, we would release an accelerate version supported by TensorRT.

PrinceVictor avatar Aug 06 '24 07:08 PrinceVictor