Potential memory leak with extract_words function

Open maximeBAY opened this issue 1 year ago • 1 comments

Describe the bug

There seems to be a memory leak when running the extract_words function. Although I've explored past issues that indicated that the use of page.close() or page.get_textmap.cache_clear(), running those dont solve my issue.

I'm doing: words = page.extract_words( x_tolerance=1, y_tolerance=1, extra_attrs=["fontname", "size"] )

on every page on different pdf files, and the memory keeps increasing

Code to reproduce the problem

words = page.extract_words( x_tolerance=1, y_tolerance=1, extra_attrs=["fontname", "size"] )

Expected behavior

Memory should not constantly increase (or at least be released at some point) after every call of extract_words.

Actual behavior

The memory keeps increasing over time and is never released, I've ran profiling using memory-profiler package, and it all points towards the extract_words function not properly releasing memory.

Line     Mem usage    Increment  Occurrences   Line Contents
=============================================================
   511    348.7 MiB    348.7 MiB           1       @profile
   512                                             def extract_words(self, **kwargs: Any) -> T_obj_list:
   513    349.1 MiB      0.4 MiB           1           return utils.extract_words(self.chars, **kwargs)
   
Next execution:
  
  Line     Mem usage    Increment  Occurrences   Line Contents
=============================================================
   511    349.4 MiB    349.4 MiB           1       @profile
   512                                             def extract_words(self, **kwargs: Any) -> T_obj_list:
   513    350.1 MiB      0.6 MiB           1           return utils.extract_words(self.chars, **kwargs)

As you can see, the memory is never released.

Environment

pdfplumber version: 0.11.4
Python version: 3.10
OS: Windows, or Linux, tested in both.

Nov 25 '24 19:11 maximeBAY

Thank you for flagging, @maximeBAY. I believe that this might be related to other memory issues flagged elsewhere. But just to check: Do you see similar memory issues if you replace page.extract_words(...) with just len(page.chars)? Or only when using .extract_words(...)?

Dec 09 '24 04:12 jsvine