pdfminer
pdfminer copied to clipboard
PDF Miner returns different results every time
I have noticed the issue with PDF miner. It returns different results each time for my PDF doc. This is my code:
import requests
from io import BytesIO
from pdfminer import high_level
def pdf_sublink_extraction(pdf_links, sleep):
associatedTextList = []
for pdf_link in pdf_links:
print("pdf link", pdf_link, '\n')
try:
response = requests.get(pdf_link)
print('response', response, '\n')
with BytesIO(response.content) as data:
num_of_pages = len(list(high_level.extract_pages(data)))
full_pdf_text = high_level.extract_text(data, password='', page_numbers = None, maxpages = 5, codec='utf-8', caching=True, laparams=None)
full_pdf_text = full_pdf_text.replace('\n\n\n\n', '\n').strip()
except:
full_pdf_text = "PDF File: " + pdf_link + "\n\nUnable to parse PDF file!"
return full_pdf_text
print(pdf_sublink_extraction(['https://www.buelach.ch/fileadmin/files/documents/Finanzen/2016_2020_finanzplan.pdf'], 0))
print()
print()
print(pdf_sublink_extraction(['https://www.buelach.ch/fileadmin/files/documents/Finanzen/2016_2020_finanzplan.pdf'], 0))
I checked the results with this tool: https://www.diffchecker.com/diff
And it returns different results. The difference is in numbers in some lines.
Is that a bug, or Im doing something wrong?
If you run python version less than 3.7 you might get non deterministic behavior. https://stackoverflow.com/questions/14956313/why-is-dictionary-ordering-non-deterministic
Try upgrading to 3.7 and see if it runs more consistent