pdfplumber
pdfplumber copied to clipboard
Memory issues on very large PDFs
I'm currently trying to extract a ~28,000 page PDF (not a typo) and am running up against memory limits when I run in a loop.
import pandas as pd
import pdfplumber
from os import path
#Read in data
pdf = pdfplumber.open("data/my.pdf")
#Create settings for extraction
table_settings = {
"vertical_strategy": "text", #No lines on table
"horizontal_strategy": "text", #No lines on table
"explicit_vertical_lines": [],
"explicit_horizontal_lines": [],
"snap_tolerance": 3,
"join_tolerance": 3,
"edge_min_length": 3,
"min_words_vertical": 3,
"min_words_horizontal": 1,
"keep_blank_chars": False,
"text_tolerance": 3,
"text_x_tolerance": None,
"text_y_tolerance": None,
"intersection_tolerance": 3,
"intersection_x_tolerance": None,
"intersection_y_tolerance": None,
}
COLUMNS = [
'Work Date',
'Employee Number',
'Pay Type',
'Hours',
'Account Number',
'Hourly Rate',
'Gross',
'Job Code',
'Activity Code'
]
#Begin extracting pages one at a time
page_num = 0
for page in pdf.pages:
try:
#Pull the table
table = page.extract_table(table_settings)
#Drop the first row
table = table[1:]
#Read into Dataframe
df = pd.DataFrame(table,columns=COLUMNS)
#Output to CSV
df.to_csv(path.join('data','output',str(page_num) + '.csv'))
page_num += 1
#There's bound to be a billion issues with the data
except Exception as ex:
print("Error on page ", page_num, ".")
print(ex)
page_num += 1
I'm handling this one page at a time because if this bombs out at any point, then I lose all my work.
As the loop runs, memory consumption keeps growing until it hits about 5Gb (about all the space I have left on my machine).
I suspect a memory leak, but I'm not sure. I'd figure that memory would be released as the loop iterates.
Hi @SpencerNorris, and thanks for pushing the limits of pdfplumber
! While it's possible there's a memory leak in pdfplumber
itself, it's hard to debug this issue without the PDF in hand. It's also possible that the issue stems from pdfminer.six
, which underlies this tool. If you try running your PDF through pdfminer.six
's pdf2txt, do you run into the same memory issues?
To get around this, I open and close the PDF file each time.
with pdfplumber.open("data/my.pdf") as pdf:
num_of_pages = len(pdf.pages)
for page_number in range(num_of_pages):
with pdfplumber.open("data/my.pdf") as pdf:
page = pdf.pages[page_number]
pass
Garbage collection doesn't seem to happen until the PDF is closed. I'm not sure what causes the issue but maybe this can help in somebody figuring that out.
For context, a 41 page PDF (quite complicated as it was a CAD drawing) would peak at 9.5gb of memory. With the above solution a peak of around 450-500mb was used.
I had the same issue (memory leak) with pdflumber whilst extracting data from an 8,000 page pdf document. I tried implementing @rosswf solution with garbage collection and it worked but the time taken was extremely long. It took me approximately 48 hours and 200MB to extract all the text. However, using pdftotext worked extremely well. It took me approximately 1 minute and 49MB to extract all the data.
Yes, this is a problem, and one I'd like to fix. Based on a bit of exploration, it seems that the memory issues might stem from within pdfminer.six
, possibly in PDFPageInterpreter.execute(...)
or PDFContentParser
. I haven't found the time to dig deeper yet, but hope to. Closing this issue in favor of the more recently active one here: https://github.com/jsvine/pdfplumber/issues/263
I'll aim to update this thread if I find a solution, and certainly welcome any insight from people following this discussion.
There is an issue with pdfminer or pdfplumber,
If you want to get around this - use the following code snippet. I have used the code from @rosswf and improved it for reading a batch of pages instead of a single page at a time, and hence is very fast.
def split(a, n):
k, m = divmod(len(a), n)
return (a[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))
with pdfplumber.open(file_path) as pdf:
total_pages = len(pdf.pages)
# magic number - 500 --> change this as per your memory limits, how big pdf can you read without the memory error
page_ranges = list(split(range(total_pages), math.ceil(total_pages/500)))
# print(f"page ranges -> {page_ranges}")
for page_range in page_ranges:
with pdfplumber.open(file_path) as pdf:
for page_number in list(page_range):
pg = pdf.pages[page_number]
Thanks for re-flagging this. Based on some testing, I think there's a more straightforward solution — one which does not require you to open and close the PDF multiple times:
with pdfplumber.open("data/my.pdf") as pdf:
for page in pdf.pages:
run_my_code()
del page._objects
del page._layout
If this approach works, I'll aim to get a more convenient page-closing method into the next release.
Update: Hah, I forgot that pdfplumber
already has an (undocumented) way of doing this :)
with pdfplumber.open("data/my.pdf") as pdf:
for page in pdf.pages:
run_my_code()
page.flush_cache()
I've had memory issues as well. I've tried without success to find the possible leak, but in case it helps here are three runs of the same workflow with different approaches. The workflow consists of download, crop and text extraction from about 15-20 PDFs, each of them with 1 to 36 pages.
-
Extracting text from a PDF at a time (each spike a PDF; the massive memory use spike is the PDF with 36 pages, increasing for each page):
-
Extracting text from a page at a time (treating each page as if it were an independent PDF), and afterwards concatenating the extracted text strings from each page of a specific document:
-
Same as 2 but clearing pdfplumber's _decimalize lru_cache after each page extraction (this is, merely adding ONE extra line of code at the end of each extraction:
pdfplumber.utils._decimalize.cache_clear()
). This had an almost negligible negative time impact (at least for my workflow), however kept memory much lower and well under control during the whole workflow.
I found that after extracting text, the lru_cache was somehow not being cleared causing the memory to keep filling up and eventually run out of it. After some playing around I found the following code helped me. In the code below, I am clearing the page cache and the lru cache.
`
with pdfplumber.open("path-to-pdf/my.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text(layout=True)
page.flush_cache()
# This was the fn where cache is implemented
page.get_text_layout.cache_clear() `
PS: I am currently using pdfplumber version 0.71. I hope this helps someone.
Hi @navkirat, and thanks for flagging. Are you able to share the PDF that triggered the memory issues?
uhh I tried all of the above and seems nothing is working for me. I used memory_profiler and this is what I got-
As you can see there is a spike at line 27 and 28 which doesnt come down. Also if I run it multiple times, memory usage just keeps on climbing.
I'm making an async web request to fetch 392 pages long pdf and trying to extract one particular page of the pdf.
uhh I tried all of the above and seems nothing is working for me. I used memory_profiler and this is what I got-
As you can see there is a spike at line 27 and 28 which doesnt come down. Also if I run it multiple times, memory usage just keeps on climbing. I'm making an async web request to fetch 392 pages long pdf and trying to extract one particular page of the pdf.
Having similar problem. Have you found any solution?
What is the size of the PDF file?
On Mon, 26 Feb, 2024, 18:38 AnuraagKhare, @.***> wrote:
uhh I tried all of the above and seems nothing is working for me. I used memory_profiler and this is what I got- [image: Untitled] https://user-images.githubusercontent.com/42300701/194978592-cd9eef31-0d26-42a9-af77-51d1b85c4776.png As you can see there is a spike at line 27 and 28 which doesnt come down. Also if I run it multiple times, memory usage just keeps on climbing. I'm making an async web request to fetch 392 pages long pdf and trying to extract one particular page of the pdf.
Having similar problem. Have you found any solution?
— Reply to this email directly, view it on GitHub https://github.com/jsvine/pdfplumber/issues/193#issuecomment-1964107897, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6BKNLIWJ7XX3JVENHGHPTYVSCNJAVCNFSM4LPV34S2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJWGQYTANZYHE3Q . You are receiving this because you were mentioned.Message ID: @.***>
What is the size of the PDF file?
I am processing close to 60 files consecutively. each not more than 1 MB in size. The memory usage keeps on increasing even after extract_text_from_pdf() functions returns the extracted text.
Thanks for flagging @AnuraagKhare. Can you share a PDF that, when processed repeatedly, reproduces the issue? (Or all 60 files, but I figured 1 will be simpler.)
Update: Hah, I forgot that
pdfplumber
already has an (undocumented) way of doing this :)with pdfplumber.open("data/my.pdf") as pdf: for page in pdf.pages: run_my_code() page.flush_cache()
does this really work? it seems not...
@xsank .flush_cache()
refers, specifically, to objects that pdfplumber
itself has explicitly cached. Unfortunately, I haven't found a way to free up the memory that pdfminer.six
is allocating.
Did anyone found a solution for pdfminer.six
memory problem?
Hey, just thought I would contribute here, since I just ran into this issue...
I have a 211 page PDF that I am testing pdfplumber
with (great tool, by the way, thank you @jsvine!), and after looking at various solutions, I have found the following:
Original run (no memory management code):
Time taken: 59.85609221458435
Filename: pdf_loader.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
6 30.9 MiB 30.9 MiB 1 @profile
7 def parse_pdf(file_path):
8 30.9 MiB 0.0 MiB 1 start = time.time()
9 765.4 MiB 0.3 MiB 2 with pdfplumber.open(file_path) as pdf:
10 765.1 MiB 1.3 MiB 212 for page in pdf.pages:
11 765.1 MiB 732.8 MiB 211 temp = page.extract_text()
12 765.1 MiB 0.0 MiB 211 print(f"--------- PAGE {page.page_number} ({len(temp)}) ----------")
13 765.4 MiB 0.0 MiB 1 print(f"Time taken: {time.time() - start}")
Adding in two lines of memory management code:
Time taken: 105.56745266914368
Filename: pdf_loader.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
6 30.7 MiB 30.7 MiB 1 @profile
7 def parse_pdf(file_path):
8 30.7 MiB 0.0 MiB 1 start = time.time()
9 43.8 MiB 0.1 MiB 2 with pdfplumber.open(file_path) as pdf:
10 43.8 MiB 1.1 MiB 212 for page in pdf.pages:
11 43.8 MiB 11.2 MiB 211 temp = page.extract_text()
12 43.8 MiB -0.3 MiB 211 print(f"--------- PAGE {page.page_number} ({len(temp)}) ----------")
13 43.8 MiB -0.3 MiB 211 page.flush_cache()
14 43.8 MiB -0.3 MiB 211 page.get_textmap.cache_clear()
15
16 43.8 MiB 0.0 MiB 1 print(f"Time taken: {time.time() - start}")
As you can see, when I add the following code to clean up the caches:
page.flush_cache()
page.get_textmap.cache_clear()
The memory usage is actually pretty reasonable!
There is a significant time difference (~45 seconds), but this is a fair trade-off for reducing the memory footprint to something manageable. Also, since I am performing a lot of post-processing on each page (on the order of minutes per page) this is negligible for me.
Note: You need BOTH the .flush_cache()
and .get_textmap.cache_clear()
for this to be effective.
[edit] On subsequent re-runs, it looks like the time is about the same between the two- so I am not sure where that extra 45 seconds came from originally.