Potential memory leak with extract_words function
Describe the bug
There seems to be a memory leak when running the extract_words function. Although I've explored past issues that indicated that the use of page.close() or page.get_textmap.cache_clear(), running those dont solve my issue.
I'm doing: words = page.extract_words( x_tolerance=1, y_tolerance=1, extra_attrs=["fontname", "size"] )
on every page on different pdf files, and the memory keeps increasing
Code to reproduce the problem
words = page.extract_words( x_tolerance=1, y_tolerance=1, extra_attrs=["fontname", "size"] )
Expected behavior
Memory should not constantly increase (or at least be released at some point) after every call of extract_words.
Actual behavior
The memory keeps increasing over time and is never released, I've ran profiling using memory-profiler package, and it all points towards the extract_words function not properly releasing memory.
Line Mem usage Increment Occurrences Line Contents
=============================================================
511 348.7 MiB 348.7 MiB 1 @profile
512 def extract_words(self, **kwargs: Any) -> T_obj_list:
513 349.1 MiB 0.4 MiB 1 return utils.extract_words(self.chars, **kwargs)
Next execution:
Line Mem usage Increment Occurrences Line Contents
=============================================================
511 349.4 MiB 349.4 MiB 1 @profile
512 def extract_words(self, **kwargs: Any) -> T_obj_list:
513 350.1 MiB 0.6 MiB 1 return utils.extract_words(self.chars, **kwargs)
As you can see, the memory is never released.
Environment
- pdfplumber version: 0.11.4
- Python version: 3.10
- OS: Windows, or Linux, tested in both.
Thank you for flagging, @maximeBAY. I believe that this might be related to other memory issues flagged elsewhere. But just to check: Do you see similar memory issues if you replace page.extract_words(...) with just len(page.chars)? Or only when using .extract_words(...)?