pdfminer PDF Miner returns different results every time

PDF Miner returns different results every time

Open aleksandar-devedzic opened this issue 3 years ago • 1 comments

I have noticed the issue with PDF miner. It returns different results each time for my PDF doc. This is my code:

import requests
from io import BytesIO
from pdfminer import high_level

def pdf_sublink_extraction(pdf_links, sleep):

    associatedTextList = []
    for pdf_link in pdf_links:
        print("pdf link", pdf_link, '\n')
        try:
            response = requests.get(pdf_link)
            print('response', response, '\n')
            with BytesIO(response.content) as data:

                num_of_pages = len(list(high_level.extract_pages(data)))

                full_pdf_text = high_level.extract_text(data, password='', page_numbers = None, maxpages = 5, codec='utf-8', caching=True, laparams=None)
                full_pdf_text = full_pdf_text.replace('\n\n\n\n', '\n').strip()

        except:
            full_pdf_text = "PDF File: " + pdf_link + "\n\nUnable to parse PDF file!"

    return full_pdf_text

print(pdf_sublink_extraction(['https://www.buelach.ch/fileadmin/files/documents/Finanzen/2016_2020_finanzplan.pdf'], 0))
print()
print()
print(pdf_sublink_extraction(['https://www.buelach.ch/fileadmin/files/documents/Finanzen/2016_2020_finanzplan.pdf'], 0))

I checked the results with this tool: https://www.diffchecker.com/diff

And it returns different results. The difference is in numbers in some lines.

Is that a bug, or Im doing something wrong?

Apr 16 '21 08:04 aleksandar-devedzic

If you run python version less than 3.7 you might get non deterministic behavior. https://stackoverflow.com/questions/14956313/why-is-dictionary-ordering-non-deterministic

Try upgrading to 3.7 and see if it runs more consistent

Sep 03 '21 18:09 kriffe

pdfminer pdfminer copied to clipboard

PDF Miner returns different results every time

pdfminer
pdfminer copied to clipboard