pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

Same sentence is printed three times for a specific PDF file when using pdf2txt

Open ehtec opened this issue 3 years ago • 2 comments

Bug report

When I use pdf2txt on a specific PDF file, I get some sentences printed out three times:

high  volatility  of  their  newly  traded  tokens.  By immediately  allowing  all  new  projects  to 
high  volatility  of  their  newly  traded  tokens.  By immediately  allowing  all  new  projects  to 
high  volatility  of  their  newly  traded tokens.  By  immediately  allowing  all  new  projects  to 
come  on  board  METASEER,  this  inevitably  creates  a  utility  for  the  new  project  that 
come  on  board  METASEER,  this  inevitably  creates  a  utility  for  the  new  proj
come  on  board  METASEER,  this  inevitably  creates  a  utility  for  the  new  proj
onboards with us. The new projects will have to create a small liquidity pool and will be 
onboards with us. The new projects will have to create a small liquidity pool and will be 
onboards with us. The new projects will have to create a small liquidity pool and will be 
in the form of (project X/METAS).  
in the form of (project X/METAS). 

friendly  Hybrid  Options 
METASEER  aims  to  provide  its  users  with  a  seamless  user-friendly  Hybrid  Options 
METASEER  aims  to  provide  its  users  with  a  seamless  user
trading platform along with competitive t
ransaction fees and a broader range of markets 
trading platform along with competitive transaction fees and a broader range of markets 
that can be traded. In the future, METASEER will also integrate derivative assets from 
that can be traded. In the future, METASEER will also integrate derivative assets from 
that can be traded. In the future, METASEER will also integrate derivative assets from 
traditional finance such as S&P 500 (SPX), Dow Jones Industrial (DJI), Gold and Brent 
traditional finance such as S&P 500 (SPX), Dow Jones Industrial (DJI), Gold and Brent 
traditional finance such as S&P 500 (SPX), Dow Jones Industrial (DJI), Gold and Brent 
tocks.  Among  the  advantages  of  METASEER 
Crude  Oil,  FX  majors  and  high  caps  Stocks.  Among  the  advantages  of  METASEER 
Crude  Oil,  FX  majors  and  high  caps  S
include:  

a.  Being  decentralized, 

prevent  unnecessary  disruptions  and 
Being  decentralized,  METASEER  prevent  unnecessary  disruptions  and 
interventions by a centralized platform 
interventions by a centralized platform

The command I use:

./pdf2txt.py filename.pdf

Expected output: Print the sentences only once, as they occur only once.

Note: I also tried to extract it using a different method (see https://git.ehtec.co/research/pie-chart-ocr/-/blob/32-read-text-and-bounding-boxes-from-pdfs-where-text-is-stored-as-text-and-not-as-image/piechartocr/pdf_extractor.py), which produced the same result. Interestingly, the bounding boxes are not the same for all copies of an affected row:

INFO:root:unsorted_output: ('success and equitable rewards. As such, METASEER Hybrid options will allow users to', 10, 'MABLVX+Helvetica', None, 5, (113.759954496, 441.3773976489703, 520.8010265932544, 451.7554934977303))
INFO:root:unsorted_output: ('success and equitable rewards. As such, METASEER Hybrid options will allow users to', 10, 'MABLVX+Helvetica', None, 5, (113.51995459199999, 441.3773976489703, 520.8008537406266, 451.7554934977303))
INFO:root:unsorted_output: ('success and equitable rewards. As such, METASEER Hybrid options will allow users to', 10, 'MABLVX+Helvetica', None, 5, (113.759954496, 441.3773976489703, 520.8010269041473, 451.7554934977303))
INFO:root:unsorted_output: ('trade options in any direction either Call or Put with a variety of choices such as In th', 10, 'MABLVX+Helvetica', None, 5, (113.759954496, 428.53740278497037, 512.4025985781506, 438.91549863373035))
INFO:root:unsorted_output: ('options in any direction either Call or Put with a variety of choices such as In the', 10, 'MABLVX+Helvetica', None, 5, (140.879943648, 428.53740278497037, 520.8007448395429, 438.91549863373035))
INFO:root:unsorted_output: ('options in any direction either Call or Put with a variety of choices such as In th', 10, 'MABLVX+Helvetica', None, 5, (141.119943552, 428.53740278497037, 512.4017732162187, 438.91549863373035))
INFO:root:unsorted_output: ('Money (ITM), At the Money (ATM) or Out of the Money (OTM) based on dynamic strike', 10, 'MABLVX+Helvetica', None, 5, (113.51995459199999, 415.6974079209703, 520.8007382747468, 426.0755037697303))
INFO:root:unsorted_output: ('Money (ITM), At the Money (ATM) or Out of the Money (OTM) based on dynamic strike', 10, 'MABLVX+Helvetica', None, 5, (113.759954496, 415.6974079209703, 520.8009108413529, 426.0755037697303))
INFO:root:unsorted_output: ('Money (ITM), At the Money (ATM) or Out of the Money (OTM) based on dynamic strike', 10, 'MABLVX+Helvetica', None, 5, (113.759954496, 415.6974079209703, 520.8010859181643, 426.0755037697303))
INFO:root:unsorted_output: ('prices.', 10, 'MABLVX+Helvetica', None, 5, (113.759954496, 402.97741300897036, 147.1209531182187, 413.35550885773034))

I could not reproduce the problem with another PDF file. I tried to convert the file to text using an online service (https://www.pdf2go.com) which worked without the issue; so it must be a problem with this library and not the PDF file.

This is the affected file:

METASEER_Whitepaper_v7.7.pdf

Both the python3-pdfminer package from Debian 11 and pdfminer.six from PyPi are affected.

ehtec avatar Jan 14 '22 00:01 ehtec

Update: I found another file affected by a similar issue:

WHITEPAPER.pdf

When using my own code which also outputs positions, I see that entries are duplicated again. However, this time not right below each other, but the same items appear on different pages (although visible only on one):

INFO:root:unsorted_output: ('V . FLOWCOM BUSINESS MODEL: SUSTAINABILITY', 16, 'YSSKMV+VLAbelPro-Bold', None, 8, (-1187.8862, 728.0488, -834.3022, 744.0488))
INFO:root:unsorted_output: ('VII. FLOWCOM R&D', 16, 'YSSKMV+VLAbelPro-Bold', None, 8, (42.6646, 728.0488, 179.96060000000006, 744.0488))
INFO:root:unsorted_output: ('VIII. ROADMAP', 16, 'YSSKMV+VLAbelPro-Bold', None, 8, (657.9404, 728.0488, 763.0763999999999, 744.0488))
INFO:root:unsorted_output: ('V . FLOWCOM BUSINESS MODEL: SUSTAINABILITY', 16, 'YSSKMV+VLAbelPro-Bold', None, 9, (-1803.1616, 728.0488, -1449.5775999999998, 744.0488))
INFO:root:unsorted_output: ('VII. FLOWCOM R&D', 16, 'YSSKMV+VLAbelPro-Bold', None, 9, (-572.6108, 728.0488, -435.3148, 744.0488))
INFO:root:unsorted_output: ('VIII. ROADMAP', 16, 'YSSKMV+VLAbelPro-Bold', None, 9, (42.665, 728.0488, 147.80100000000002, 744.0488))

The issue appears no matter if I sort the entries by y coordinates or not.

ehtec avatar Jan 14 '22 01:01 ehtec

I can reproduce this issue.

pietermarsman avatar Feb 12 '22 13:02 pietermarsman