pdfminer.six
pdfminer.six copied to clipboard
Same sentence is printed three times for a specific PDF file when using pdf2txt
Bug report
When I use pdf2txt
on a specific PDF file, I get some sentences printed out three times:
high volatility of their newly traded tokens. By immediately allowing all new projects to
high volatility of their newly traded tokens. By immediately allowing all new projects to
high volatility of their newly traded tokens. By immediately allowing all new projects to
come on board METASEER, this inevitably creates a utility for the new project that
come on board METASEER, this inevitably creates a utility for the new proj
come on board METASEER, this inevitably creates a utility for the new proj
onboards with us. The new projects will have to create a small liquidity pool and will be
onboards with us. The new projects will have to create a small liquidity pool and will be
onboards with us. The new projects will have to create a small liquidity pool and will be
in the form of (project X/METAS).
in the form of (project X/METAS).
friendly Hybrid Options
METASEER aims to provide its users with a seamless user-friendly Hybrid Options
METASEER aims to provide its users with a seamless user
trading platform along with competitive t
ransaction fees and a broader range of markets
trading platform along with competitive transaction fees and a broader range of markets
that can be traded. In the future, METASEER will also integrate derivative assets from
that can be traded. In the future, METASEER will also integrate derivative assets from
that can be traded. In the future, METASEER will also integrate derivative assets from
traditional finance such as S&P 500 (SPX), Dow Jones Industrial (DJI), Gold and Brent
traditional finance such as S&P 500 (SPX), Dow Jones Industrial (DJI), Gold and Brent
traditional finance such as S&P 500 (SPX), Dow Jones Industrial (DJI), Gold and Brent
tocks. Among the advantages of METASEER
Crude Oil, FX majors and high caps Stocks. Among the advantages of METASEER
Crude Oil, FX majors and high caps S
include:
a. Being decentralized,
prevent unnecessary disruptions and
Being decentralized, METASEER prevent unnecessary disruptions and
interventions by a centralized platform
interventions by a centralized platform
The command I use:
./pdf2txt.py filename.pdf
Expected output: Print the sentences only once, as they occur only once.
Note: I also tried to extract it using a different method (see https://git.ehtec.co/research/pie-chart-ocr/-/blob/32-read-text-and-bounding-boxes-from-pdfs-where-text-is-stored-as-text-and-not-as-image/piechartocr/pdf_extractor.py), which produced the same result. Interestingly, the bounding boxes are not the same for all copies of an affected row:
INFO:root:unsorted_output: ('success and equitable rewards. As such, METASEER Hybrid options will allow users to', 10, 'MABLVX+Helvetica', None, 5, (113.759954496, 441.3773976489703, 520.8010265932544, 451.7554934977303))
INFO:root:unsorted_output: ('success and equitable rewards. As such, METASEER Hybrid options will allow users to', 10, 'MABLVX+Helvetica', None, 5, (113.51995459199999, 441.3773976489703, 520.8008537406266, 451.7554934977303))
INFO:root:unsorted_output: ('success and equitable rewards. As such, METASEER Hybrid options will allow users to', 10, 'MABLVX+Helvetica', None, 5, (113.759954496, 441.3773976489703, 520.8010269041473, 451.7554934977303))
INFO:root:unsorted_output: ('trade options in any direction either Call or Put with a variety of choices such as In th', 10, 'MABLVX+Helvetica', None, 5, (113.759954496, 428.53740278497037, 512.4025985781506, 438.91549863373035))
INFO:root:unsorted_output: ('options in any direction either Call or Put with a variety of choices such as In the', 10, 'MABLVX+Helvetica', None, 5, (140.879943648, 428.53740278497037, 520.8007448395429, 438.91549863373035))
INFO:root:unsorted_output: ('options in any direction either Call or Put with a variety of choices such as In th', 10, 'MABLVX+Helvetica', None, 5, (141.119943552, 428.53740278497037, 512.4017732162187, 438.91549863373035))
INFO:root:unsorted_output: ('Money (ITM), At the Money (ATM) or Out of the Money (OTM) based on dynamic strike', 10, 'MABLVX+Helvetica', None, 5, (113.51995459199999, 415.6974079209703, 520.8007382747468, 426.0755037697303))
INFO:root:unsorted_output: ('Money (ITM), At the Money (ATM) or Out of the Money (OTM) based on dynamic strike', 10, 'MABLVX+Helvetica', None, 5, (113.759954496, 415.6974079209703, 520.8009108413529, 426.0755037697303))
INFO:root:unsorted_output: ('Money (ITM), At the Money (ATM) or Out of the Money (OTM) based on dynamic strike', 10, 'MABLVX+Helvetica', None, 5, (113.759954496, 415.6974079209703, 520.8010859181643, 426.0755037697303))
INFO:root:unsorted_output: ('prices.', 10, 'MABLVX+Helvetica', None, 5, (113.759954496, 402.97741300897036, 147.1209531182187, 413.35550885773034))
I could not reproduce the problem with another PDF file. I tried to convert the file to text using an online service (https://www.pdf2go.com) which worked without the issue; so it must be a problem with this library and not the PDF file.
This is the affected file:
Both the python3-pdfminer
package from Debian 11 and pdfminer.six
from PyPi are affected.
Update: I found another file affected by a similar issue:
When using my own code which also outputs positions, I see that entries are duplicated again. However, this time not right below each other, but the same items appear on different pages (although visible only on one):
INFO:root:unsorted_output: ('V . FLOWCOM BUSINESS MODEL: SUSTAINABILITY', 16, 'YSSKMV+VLAbelPro-Bold', None, 8, (-1187.8862, 728.0488, -834.3022, 744.0488))
INFO:root:unsorted_output: ('VII. FLOWCOM R&D', 16, 'YSSKMV+VLAbelPro-Bold', None, 8, (42.6646, 728.0488, 179.96060000000006, 744.0488))
INFO:root:unsorted_output: ('VIII. ROADMAP', 16, 'YSSKMV+VLAbelPro-Bold', None, 8, (657.9404, 728.0488, 763.0763999999999, 744.0488))
INFO:root:unsorted_output: ('V . FLOWCOM BUSINESS MODEL: SUSTAINABILITY', 16, 'YSSKMV+VLAbelPro-Bold', None, 9, (-1803.1616, 728.0488, -1449.5775999999998, 744.0488))
INFO:root:unsorted_output: ('VII. FLOWCOM R&D', 16, 'YSSKMV+VLAbelPro-Bold', None, 9, (-572.6108, 728.0488, -435.3148, 744.0488))
INFO:root:unsorted_output: ('VIII. ROADMAP', 16, 'YSSKMV+VLAbelPro-Bold', None, 9, (42.665, 728.0488, 147.80100000000002, 744.0488))
The issue appears no matter if I sort the entries by y coordinates or not.
I can reproduce this issue.