表格识别速度慢、关键是效果很不好,基本处于不可用状态
Description of the bug | 错误描述
companies-list.pdf 5分钟识别结果如下,latex解析效果差: $ begin{tabular}{cccccccc}\multicolumn{2}{c}{\textit{xexcuracy}} & \multicolumn{6}{c}{\textit{Fuge. $z$}} \[2mm]\multicolumn{2}{c}{\textbf{GROUP NAME}} & \multicolumn{1}{c}{\textbf{GROUP}} & \multicolumn{1}{c}{\textbf{CO NO}} & \multicolumn{1}{c}{\textbf{SIMT}} & \multicolumn{1}{c}{\textbf{STATIS}} & \multicolumn{1}{c}{\textbf{ST}} & \multicolumn{1}{c}{\textbf{COMPANY NAME}} \[2mm] & CVS,GRP & 1 & 1827 & X & 1 & PA & AEINA,HLTH,ASSUR RANC \[2mm] & 9938 & X & 1 & CT & AEINA,HLTH,INC CT CORP \[2mm] & 9088 & X & 1 & FL & AEINA,HLTH,INC FL CORP \[2mm] & 95994 & X & 1 & GA & AEINA,HLTH,INC GA CORP \[2mm] & 9817 & X & 1 & ME & AFINA,HLTH,INC ME CORP \[2mm] & 9527 & X & 1 & NJ & AFINA,HLTH,INC NI CORP \[2mm] & 9524 & X & 1 & NY & AEINA,HLTH,INC NY CORP \[2mm] & 9510 & X & 1 & PA & AEINA,HLTH,INC PA CORP \[2mm] & 95860 & X & 1 & TX & AEINA,HLTH,INC TX CORP \[2mm] & 7082 & X & 1 & PA & AEINA,HLTH,INS CO \[2mm] & 8480 & X & 1 & NY & AEINA,HLTH,INS CO OF NY \[2mm] & 9524 & X & 1 & IA & AEINA,HLTH,OF IA INC \[2mm] & 9575 & X & 1 & MI & AEINA,HLTH,OF MI INC \[2mm] & 1808 & X & 1 & OH & AEINA,HLTH,OF OH INC \[2mm] & 9547 & X & 1 & UT & AFINA,HLTH,OF UTANINC \[2mm] & 6084 & L & 1 & CT & AEINA,LIFE INS CO \[2mm] & 17852 & X & 1 & MN & ALINA,HLTH,& AFINA,HLTH,PLANINC \[2mm] & 1694 & X & 1 & MN & ALINA,HLTH,& AEINA,INS CO \[2mm] & 1221 & L & 1 & TN & AMERICAN,CONTINTAL INS CO \[2mm] & 1088 & X & 1 & AZ & BANNER,HLTH,& ASTNA,HLTH,INS CO \[2mm] & 168 & X & 1 & AZ & BANNER,HLTH,& AEINA,HLTH,PLANINC \[2mm] & 68500 & L & 1 & TN & CONTINTAL,LIFE INS CO BENTwood \[2mm] & 81973 & X & 1 & MO & CoverNITY,HLTH,& LIFE INS CO \[2mm] & 74160 & X & 1 & IL & CoverNITY,HLTH,CARE OF IL INC \[2mm] & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & \ & & & & & & & $
How to reproduce the bug | 如何复现
。
Operating system | 操作系统
Linux
Python version | Python 版本
3.10
Software version | 软件版本 (magic-pdf --version)
0.6.x
Device mode | 设备模式
cuda
这个是还没识别完成,你把max_time 调大,最后要出现end{tabular}才算识别完成。表格识别我们是建议有cuda的机器使用的,你有把device设置成cuda吗,识别时长的日志截图看下。正常一个表格用cuda 100秒以内就可以完成
是GPU的,400w的A100,按理说性能可以了,但是一个表格的识别耗时还是这么长,我有看nvidia-smi,96%gpu利用。可以帮忙测试一下这个文档吗
这个是还没识别完成,你把max_time 调大,最后要出现end{tabular}才算识别完成。表格识别我们是建议有cuda的机器使用的,你有把device设置成cuda吗,识别时长的日志截图看下。正常一个表格用cuda 100秒以内就可以完成
我把你这个pdf转png进行了一下表格识别,latex代码如下\begin{tabular}{p{6cm}ccccl}\\\hline \multicolumn{5}{c}{\bf Data Year 2023} \\ [-0.25cm]\multicolumn{5}{c}{\bf December 21, 2023} \\ [0.2cm]\hline \multicolumn{5}{c}{\bf Group name} \\ [0.2cm]\hline \\ [-0.25cm]\multirow{2}{5cm}{\bf GroUP NamE} & \multirow{1}{*}{\bf Group} & \multirow{1}{*}{\bf CO NO} & \multirow{1}{*}{\bf STM} & \multicolumn{1}{c}{\bf STATIS} & \multirow{1}{*}{\bf ST} & \multicolumn{1}{c}{\bf COMPANY NamE} \\ [0.2cm]\\\hline \\ [-0.25cm]CVS GRP & 1 & 1827 & X & 1 & PA & AEINA HLTH ASRER PA INC \\ [0.2cm]\\& 95938 & X & 1 & CT & AEINA HLTH INC CT Corp \\ [0.2cm]& 99088 & X & 1 & FL & AEINA HLTH INC FL Corp \\ [0.2cm]& 95994 & X & 1 & GA & AEINA HLTH INC GA Corp \\ [0.2cm]& 95517 & X & 1 & ME & AEINA HLTH INC ME Corp \\ [0.2cm]& 95287 & X & 1 & NJ & AEINA HLTH INC NY Corp \\ [0.2cm]& 95234 & X & 1 & NY & AFINA HLTH INC NY \\ [0.2cm]& 95109 & X & 1 & PA & ATINA HLTH Inc PA Corp \\ [0.2cm]& 98390 & X & 1 & TX & AEINA HLTH Inc TX Corp \\ [0.2cm]& 72032 & X & 1 & PA & AEINA HLTH NSC CO \\ [0.2cm]& 84450 & X & 1 & NY & AFINA HLTH NSC CO OF NY \\ [0.2cm]& 9524 & X & 1 & IA & AEINA HLTH OF IA INC \\ [0.2cm]& 95756 & X & 1 & MI & AEINA HLTH OF MI INC \\ [0.2cm]& 11805 & X & 1 & OH & AEINA HLTH OF OH INC \\ [0.2cm]& 98407 & X & 1 & UT & AEINA HLTH OF UTH INC \\ [0.2cm]& 6064 & L & 1 & CT & AEINA LIFE INS CO \\ [0.2cm]& 17332 & X & 1 & MN & ALINA HLTH \& AEINA HLTH PLAN INC \\ [0.2cm]& 16194 & X & 1 & MN & ALINA HLTH \& AEINA HLTH PLAN INC \\ [0.2cm]& 16824 & L & 1 & TN & ALINA HLTH \& AEINA ILS CO \\ [0.2cm]& 12321 & L & 1 & TN & AMERican COMP 1,1NS CO \\ [0.2cm]& 16088 & X & 1 & AZ & BANNER HLTH \& AEINA HLTH NSC CO \\ [0.2cm]& 16099 & X & 1 & AZ & BANNER HLTH \& AERNA HLTH PLAN INC \\ [0.2cm]& 68800 & L & 1 & TN & COMPA NTALA LIFE INS CO BRENTWOOD \\ [0.2cm]& 81978 & X & 1 & MO & COMPNTRY HLTH \& LIFE INS CO \\ [0.2cm]& 74160 & X & 1 & IL & CoverNTRY HLTH CARE OF IL INC \\ [0.2cm]\\\hline \\ [-0.25cm]\end{tabular}, 你可以用这个demo试一下
我把你这个pdf转png进行了一下表格识别,latex代码如下, 你可以用这个demo试一下
\begin{tabular}{p{6cm}ccccl}\\\hline \multicolumn{5}{c}{\bf Data Year 2023} \\ [-0.25cm]\multicolumn{5}{c}{\bf December 21, 2023} \\ [0.2cm]\hline \multicolumn{5}{c}{\bf Group name} \\ [0.2cm]\hline \\ [-0.25cm]\multirow{2}{5cm}{\bf GroUP NamE} & \multirow{1}{*}{\bf Group} & \multirow{1}{*}{\bf CO NO} & \multirow{1}{*}{\bf STM} & \multicolumn{1}{c}{\bf STATIS} & \multirow{1}{*}{\bf ST} & \multicolumn{1}{c}{\bf COMPANY NamE} \\ [0.2cm]\\\hline \\ [-0.25cm]CVS GRP & 1 & 1827 & X & 1 & PA & AEINA HLTH ASRER PA INC \\ [0.2cm]\\& 95938 & X & 1 & CT & AEINA HLTH INC CT Corp \\ [0.2cm]& 99088 & X & 1 & FL & AEINA HLTH INC FL Corp \\ [0.2cm]& 95994 & X & 1 & GA & AEINA HLTH INC GA Corp \\ [0.2cm]& 95517 & X & 1 & ME & AEINA HLTH INC ME Corp \\ [0.2cm]& 95287 & X & 1 & NJ & AEINA HLTH INC NY Corp \\ [0.2cm]& 95234 & X & 1 & NY & AFINA HLTH INC NY \\ [0.2cm]& 95109 & X & 1 & PA & ATINA HLTH Inc PA Corp \\ [0.2cm]& 98390 & X & 1 & TX & AEINA HLTH Inc TX Corp \\ [0.2cm]& 72032 & X & 1 & PA & AEINA HLTH NSC CO \\ [0.2cm]& 84450 & X & 1 & NY & AFINA HLTH NSC CO OF NY \\ [0.2cm]& 9524 & X & 1 & IA & AEINA HLTH OF IA INC \\ [0.2cm]& 95756 & X & 1 & MI & AEINA HLTH OF MI INC \\ [0.2cm]& 11805 & X & 1 & OH & AEINA HLTH OF OH INC \\ [0.2cm]& 98407 & X & 1 & UT & AEINA HLTH OF UTH INC \\ [0.2cm]& 6064 & L & 1 & CT & AEINA LIFE INS CO \\ [0.2cm]& 17332 & X & 1 & MN & ALINA HLTH \& AEINA HLTH PLAN INC \\ [0.2cm]& 16194 & X & 1 & MN & ALINA HLTH \& AEINA HLTH PLAN INC \\ [0.2cm]& 16824 & L & 1 & TN & ALINA HLTH \& AEINA ILS CO \\ [0.2cm]& 12321 & L & 1 & TN & AMERican COMP 1,1NS CO \\ [0.2cm]& 16088 & X & 1 & AZ & BANNER HLTH \& AEINA HLTH NSC CO \\ [0.2cm]& 16099 & X & 1 & AZ & BANNER HLTH \& AERNA HLTH PLAN INC \\ [0.2cm]& 68800 & L & 1 & TN & COMPA NTALA LIFE INS CO BRENTWOOD \\ [0.2cm]& 81978 & X & 1 & MO & COMPNTRY HLTH \& LIFE INS CO \\ [0.2cm]& 74160 & X & 1 & IL & CoverNTRY HLTH CARE OF IL INC \\ [0.2cm]\\\hline \\ [-0.25cm]\end{tabular}
这个latex转成markdwon,或者html依然很奇怪
这个是还没识别完成,你把max_time 调大,最后要出现end{tabular}才算识别完成。表格识别我们是建议有cuda的机器使用的,你有把device设置成cuda吗,识别时长的日志截图看下。正常一个表格用cuda 100秒以内就可以完成
一个表格的提取就这么费吗...一篇文档一般有3-5个,这个解析时间有点恐怖
我把你这个pdf转png进行了一下表格识别,latex代码如下, 你可以用这个demo试一下
\begin{tabular}{p{6cm}ccccl}\\\hline \multicolumn{5}{c}{\bf Data Year 2023} \\ [-0.25cm]\multicolumn{5}{c}{\bf December 21, 2023} \\ [0.2cm]\hline \multicolumn{5}{c}{\bf Group name} \\ [0.2cm]\hline \\ [-0.25cm]\multirow{2}{5cm}{\bf GroUP NamE} & \multirow{1}{*}{\bf Group} & \multirow{1}{*}{\bf CO NO} & \multirow{1}{*}{\bf STM} & \multicolumn{1}{c}{\bf STATIS} & \multirow{1}{*}{\bf ST} & \multicolumn{1}{c}{\bf COMPANY NamE} \\ [0.2cm]\\\hline \\ [-0.25cm]CVS GRP & 1 & 1827 & X & 1 & PA & AEINA HLTH ASRER PA INC \\ [0.2cm]\\& 95938 & X & 1 & CT & AEINA HLTH INC CT Corp \\ [0.2cm]& 99088 & X & 1 & FL & AEINA HLTH INC FL Corp \\ [0.2cm]& 95994 & X & 1 & GA & AEINA HLTH INC GA Corp \\ [0.2cm]& 95517 & X & 1 & ME & AEINA HLTH INC ME Corp \\ [0.2cm]& 95287 & X & 1 & NJ & AEINA HLTH INC NY Corp \\ [0.2cm]& 95234 & X & 1 & NY & AFINA HLTH INC NY \\ [0.2cm]& 95109 & X & 1 & PA & ATINA HLTH Inc PA Corp \\ [0.2cm]& 98390 & X & 1 & TX & AEINA HLTH Inc TX Corp \\ [0.2cm]& 72032 & X & 1 & PA & AEINA HLTH NSC CO \\ [0.2cm]& 84450 & X & 1 & NY & AFINA HLTH NSC CO OF NY \\ [0.2cm]& 9524 & X & 1 & IA & AEINA HLTH OF IA INC \\ [0.2cm]& 95756 & X & 1 & MI & AEINA HLTH OF MI INC \\ [0.2cm]& 11805 & X & 1 & OH & AEINA HLTH OF OH INC \\ [0.2cm]& 98407 & X & 1 & UT & AEINA HLTH OF UTH INC \\ [0.2cm]& 6064 & L & 1 & CT & AEINA LIFE INS CO \\ [0.2cm]& 17332 & X & 1 & MN & ALINA HLTH \& AEINA HLTH PLAN INC \\ [0.2cm]& 16194 & X & 1 & MN & ALINA HLTH \& AEINA HLTH PLAN INC \\ [0.2cm]& 16824 & L & 1 & TN & ALINA HLTH \& AEINA ILS CO \\ [0.2cm]& 12321 & L & 1 & TN & AMERican COMP 1,1NS CO \\ [0.2cm]& 16088 & X & 1 & AZ & BANNER HLTH \& AEINA HLTH NSC CO \\ [0.2cm]& 16099 & X & 1 & AZ & BANNER HLTH \& AERNA HLTH PLAN INC \\ [0.2cm]& 68800 & L & 1 & TN & COMPA NTALA LIFE INS CO BRENTWOOD \\ [0.2cm]& 81978 & X & 1 & MO & COMPNTRY HLTH \& LIFE INS CO \\ [0.2cm]& 74160 & X & 1 & IL & CoverNTRY HLTH CARE OF IL INC \\ [0.2cm]\\\hline \\ [-0.25cm]\end{tabular}
这个latex转成markdwon,或者html依然很奇怪
模型的输出目前是只支持latex,转成markdown或者html的效果只能取决于pypandoc这个库了,可能一些长的复杂的他就转不了了
这个是还没识别完成,你把max_time 调大,最后要出现end{tabular}才算识别完成。表格识别我们是建议有cuda的机器使用的,你有把device设置成cuda吗,识别时长的日志截图看下。正常一个表格用cuda 100秒以内就可以完成
一个表格的提取就这么费吗...一篇文档一般有3-5个,这个解析时间有点恐怖
Sorry for the misunderstanding in the previous description. Based on the issue's results, it should be considered a failure case. However, after testing the crop image from pdf on our code , most results of latex code are correct, and it took 40s on GPU A100. For inference speed, we would release an accelerate version supported by TensorRT.
这个latex转成markdwon,或者html依然很奇怪