pdfplumber icon indicating copy to clipboard operation
pdfplumber copied to clipboard

Memory issues on very large PDFs

Open SpencerNorris opened this issue 4 years ago • 20 comments

I'm currently trying to extract a ~28,000 page PDF (not a typo) and am running up against memory limits when I run in a loop.

import pandas as pd
import pdfplumber
from os import path

#Read in data
pdf = pdfplumber.open("data/my.pdf")

#Create settings for extraction
table_settings = {
    "vertical_strategy": "text", #No lines on table
    "horizontal_strategy": "text", #No lines on table
    "explicit_vertical_lines": [],
    "explicit_horizontal_lines": [],
    "snap_tolerance": 3,
    "join_tolerance": 3,
    "edge_min_length": 3,
    "min_words_vertical": 3,
    "min_words_horizontal": 1,
    "keep_blank_chars": False,
    "text_tolerance": 3,
    "text_x_tolerance": None,
    "text_y_tolerance": None,
    "intersection_tolerance": 3,
    "intersection_x_tolerance": None,
    "intersection_y_tolerance": None,
}
COLUMNS = [
    'Work Date',
    'Employee Number',
    'Pay Type',
    'Hours',
    'Account Number',
    'Hourly Rate',
    'Gross',
    'Job Code',
    'Activity Code'
]
#Begin extracting pages one at a time
page_num = 0
for page in pdf.pages:
    try:
        #Pull the table
        table = page.extract_table(table_settings)
        #Drop the first row
        table = table[1:]
        #Read into Dataframe
        df = pd.DataFrame(table,columns=COLUMNS)
        #Output to CSV
        df.to_csv(path.join('data','output',str(page_num) + '.csv'))
        page_num += 1
    #There's bound to be a billion issues with the data
    except Exception as ex:
        print("Error on page ", page_num, ".")
        print(ex)
        page_num += 1

I'm handling this one page at a time because if this bombs out at any point, then I lose all my work.

As the loop runs, memory consumption keeps growing until it hits about 5Gb (about all the space I have left on my machine).

I suspect a memory leak, but I'm not sure. I'd figure that memory would be released as the loop iterates.

SpencerNorris avatar Mar 19 '20 21:03 SpencerNorris

Hi @SpencerNorris, and thanks for pushing the limits of pdfplumber! While it's possible there's a memory leak in pdfplumber itself, it's hard to debug this issue without the PDF in hand. It's also possible that the issue stems from pdfminer.six, which underlies this tool. If you try running your PDF through pdfminer.six's pdf2txt, do you run into the same memory issues?

jsvine avatar Mar 30 '20 01:03 jsvine

To get around this, I open and close the PDF file each time.

with pdfplumber.open("data/my.pdf") as pdf:
    num_of_pages = len(pdf.pages)

for page_number in range(num_of_pages):
     with pdfplumber.open("data/my.pdf") as pdf:
        page = pdf.pages[page_number]
        pass

Garbage collection doesn't seem to happen until the PDF is closed. I'm not sure what causes the issue but maybe this can help in somebody figuring that out.

For context, a 41 page PDF (quite complicated as it was a CAD drawing) would peak at 9.5gb of memory. With the above solution a peak of around 450-500mb was used.

rosswf avatar Jul 01 '20 09:07 rosswf

I had the same issue (memory leak) with pdflumber whilst extracting data from an 8,000 page pdf document. I tried implementing @rosswf solution with garbage collection and it worked but the time taken was extremely long. It took me approximately 48 hours and 200MB to extract all the text. However, using pdftotext worked extremely well. It took me approximately 1 minute and 49MB to extract all the data.

royashoya avatar Oct 16 '20 09:10 royashoya

Yes, this is a problem, and one I'd like to fix. Based on a bit of exploration, it seems that the memory issues might stem from within pdfminer.six, possibly in PDFPageInterpreter.execute(...) or PDFContentParser. I haven't found the time to dig deeper yet, but hope to. Closing this issue in favor of the more recently active one here: https://github.com/jsvine/pdfplumber/issues/263

I'll aim to update this thread if I find a solution, and certainly welcome any insight from people following this discussion.

jsvine avatar Oct 20 '20 00:10 jsvine

There is an issue with pdfminer or pdfplumber,

If you want to get around this - use the following code snippet. I have used the code from @rosswf and improved it for reading a batch of pages instead of a single page at a time, and hence is very fast.

    def split(a, n):
        k, m = divmod(len(a), n)
        return (a[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

    with pdfplumber.open(file_path) as pdf:
        total_pages = len(pdf.pages)

    # magic number - 500 --> change this as per your memory limits, how big pdf can you read without the memory error
    page_ranges = list(split(range(total_pages), math.ceil(total_pages/500)))

    # print(f"page ranges -> {page_ranges}")
    for page_range in page_ranges:
        with pdfplumber.open(file_path) as pdf:
            for page_number in list(page_range):
                pg = pdf.pages[page_number]

gauravshegokar avatar Feb 09 '21 00:02 gauravshegokar

Thanks for re-flagging this. Based on some testing, I think there's a more straightforward solution — one which does not require you to open and close the PDF multiple times:

with pdfplumber.open("data/my.pdf") as pdf:
    for page in pdf.pages:
        run_my_code()
        del page._objects
        del page._layout

If this approach works, I'll aim to get a more convenient page-closing method into the next release.

jsvine avatar Feb 10 '21 04:02 jsvine

Update: Hah, I forgot that pdfplumber already has an (undocumented) way of doing this :)

with pdfplumber.open("data/my.pdf") as pdf:
    for page in pdf.pages:
        run_my_code()
        page.flush_cache()

jsvine avatar Feb 10 '21 14:02 jsvine

I've had memory issues as well. I've tried without success to find the possible leak, but in case it helps here are three runs of the same workflow with different approaches. The workflow consists of download, crop and text extraction from about 15-20 PDFs, each of them with 1 to 36 pages.

  1. Extracting text from a PDF at a time (each spike a PDF; the massive memory use spike is the PDF with 36 pages, increasing for each page): 00

  2. Extracting text from a page at a time (treating each page as if it were an independent PDF), and afterwards concatenating the extracted text strings from each page of a specific document: Screenshot 2021-09-25 at 16 24 49

  3. Same as 2 but clearing pdfplumber's _decimalize lru_cache after each page extraction (this is, merely adding ONE extra line of code at the end of each extraction: pdfplumber.utils._decimalize.cache_clear()). This had an almost negligible negative time impact (at least for my workflow), however kept memory much lower and well under control during the whole workflow. Screenshot 2021-09-25 at 15 52 00

acortad avatar Sep 25 '21 14:09 acortad

I found that after extracting text, the lru_cache was somehow not being cleared causing the memory to keep filling up and eventually run out of it. After some playing around I found the following code helped me. In the code below, I am clearing the page cache and the lru cache.

`

with pdfplumber.open("path-to-pdf/my.pdf") as pdf:
    for page in pdf.pages:
    text = page.extract_text(layout=True)
    page.flush_cache()

   # This was the fn where cache is implemented
   page.get_text_layout.cache_clear()     `

PS: I am currently using pdfplumber version 0.71. I hope this helps someone.

navkirat avatar Jul 12 '22 02:07 navkirat

Hi @navkirat, and thanks for flagging. Are you able to share the PDF that triggered the memory issues?

jsvine avatar Jul 16 '22 20:07 jsvine

uhh I tried all of the above and seems nothing is working for me. I used memory_profiler and this is what I got- Untitled As you can see there is a spike at line 27 and 28 which doesnt come down. Also if I run it multiple times, memory usage just keeps on climbing. I'm making an async web request to fetch 392 pages long pdf and trying to extract one particular page of the pdf.

ParthJai avatar Oct 11 '22 01:10 ParthJai

uhh I tried all of the above and seems nothing is working for me. I used memory_profiler and this is what I got- Untitled As you can see there is a spike at line 27 and 28 which doesnt come down. Also if I run it multiple times, memory usage just keeps on climbing. I'm making an async web request to fetch 392 pages long pdf and trying to extract one particular page of the pdf.

Having similar problem. Have you found any solution?

AnuraagKhare avatar Feb 26 '24 13:02 AnuraagKhare

What is the size of the PDF file?

On Mon, 26 Feb, 2024, 18:38 AnuraagKhare, @.***> wrote:

uhh I tried all of the above and seems nothing is working for me. I used memory_profiler and this is what I got- [image: Untitled] https://user-images.githubusercontent.com/42300701/194978592-cd9eef31-0d26-42a9-af77-51d1b85c4776.png As you can see there is a spike at line 27 and 28 which doesnt come down. Also if I run it multiple times, memory usage just keeps on climbing. I'm making an async web request to fetch 392 pages long pdf and trying to extract one particular page of the pdf.

Having similar problem. Have you found any solution?

— Reply to this email directly, view it on GitHub https://github.com/jsvine/pdfplumber/issues/193#issuecomment-1964107897, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6BKNLIWJ7XX3JVENHGHPTYVSCNJAVCNFSM4LPV34S2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJWGQYTANZYHE3Q . You are receiving this because you were mentioned.Message ID: @.***>

navkirat avatar Feb 26 '24 13:02 navkirat

What is the size of the PDF file?

I am processing close to 60 files consecutively. each not more than 1 MB in size. The memory usage keeps on increasing even after extract_text_from_pdf() functions returns the extracted text.

AnuraagKhare avatar Feb 26 '24 13:02 AnuraagKhare

Thanks for flagging @AnuraagKhare. Can you share a PDF that, when processed repeatedly, reproduces the issue? (Or all 60 files, but I figured 1 will be simpler.)

jsvine avatar Mar 02 '24 15:03 jsvine

Update: Hah, I forgot that pdfplumber already has an (undocumented) way of doing this :)

with pdfplumber.open("data/my.pdf") as pdf:
    for page in pdf.pages:
        run_my_code()
        page.flush_cache()

does this really work? it seems not...

xsank avatar Mar 20 '24 09:03 xsank

@xsank .flush_cache() refers, specifically, to objects that pdfplumber itself has explicitly cached. Unfortunately, I haven't found a way to free up the memory that pdfminer.six is allocating.

jsvine avatar Mar 22 '24 21:03 jsvine

Did anyone found a solution for pdfminer.six memory problem?

yoavkedem1 avatar Apr 10 '24 10:04 yoavkedem1

Hey, just thought I would contribute here, since I just ran into this issue...

I have a 211 page PDF that I am testing pdfplumber with (great tool, by the way, thank you @jsvine!), and after looking at various solutions, I have found the following:

Original run (no memory management code):

Time taken: 59.85609221458435
Filename: pdf_loader.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     6     30.9 MiB     30.9 MiB           1   @profile
     7                                         def parse_pdf(file_path):
     8     30.9 MiB      0.0 MiB           1       start = time.time()
     9    765.4 MiB      0.3 MiB           2       with pdfplumber.open(file_path) as pdf:
    10    765.1 MiB      1.3 MiB         212           for page in pdf.pages:
    11    765.1 MiB    732.8 MiB         211               temp = page.extract_text()
    12    765.1 MiB      0.0 MiB         211               print(f"--------- PAGE {page.page_number} ({len(temp)}) ----------")
    13    765.4 MiB      0.0 MiB           1       print(f"Time taken: {time.time() - start}")

Adding in two lines of memory management code:

Time taken: 105.56745266914368
Filename: pdf_loader.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     6     30.7 MiB     30.7 MiB           1   @profile
     7                                         def parse_pdf(file_path):
     8     30.7 MiB      0.0 MiB           1       start = time.time()
     9     43.8 MiB      0.1 MiB           2       with pdfplumber.open(file_path) as pdf:
    10     43.8 MiB      1.1 MiB         212           for page in pdf.pages:
    11     43.8 MiB     11.2 MiB         211               temp = page.extract_text()
    12     43.8 MiB     -0.3 MiB         211               print(f"--------- PAGE {page.page_number} ({len(temp)}) ----------")
    13     43.8 MiB     -0.3 MiB         211               page.flush_cache()
    14     43.8 MiB     -0.3 MiB         211               page.get_textmap.cache_clear()
    15
    16     43.8 MiB      0.0 MiB           1       print(f"Time taken: {time.time() - start}")

As you can see, when I add the following code to clean up the caches:

page.flush_cache()
page.get_textmap.cache_clear()

The memory usage is actually pretty reasonable!

There is a significant time difference (~45 seconds), but this is a fair trade-off for reducing the memory footprint to something manageable. Also, since I am performing a lot of post-processing on each page (on the order of minutes per page) this is negligible for me.

Note: You need BOTH the .flush_cache() and .get_textmap.cache_clear() for this to be effective.

[edit] On subsequent re-runs, it looks like the time is about the same between the two- so I am not sure where that extra 45 seconds came from originally.

aronweiler avatar May 08 '24 05:05 aronweiler