pdfparser Memory Leak

Hi i am using this to get pdf text works well, but there is memory leak in this, i am using it as under

 $parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseContent($getdata[0]);
 $pages = $pdf->getPages();
 foreach ($pages as $page) {
                 echo $page->getText();
 }
unset($parser);
unset($pdf);
unset($page);
unset($pages);
echo '<h1>Memory =', round(memory_get_usage() / 1000), ' KB</h1><br>';

Memory =63226 KB

How do i fix this, how can i release memory which is used.

May 18 '16 14:05 harinderbachhal

It's really hard to fix a such issue. There is many circular references between objects which block memory garbage. Usually we use "__destruct" to break such behavior by settings properies to "null" or unset them to help garbage collector

May 18 '16 15:05 smalot

where do i use __destruct, in this library? Parser.php this file or have to read full library ? and make changes .

May 18 '16 15:05 harinderbachhal

I am having the same problem. Trying to parse multiple PDF files within the same script ended up with a huge memory leak. With PHP 7 (or PHP >= 5.3), you can use gc_collect_cycles(); to call the garbage collector in order to delete objects with circular references. The memory usage goes back to normal for me after this call.

Aug 23 '16 17:08 ghost

Having same issue, Any solution for this.

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile(asset($pdf_file)); // [ Stuck with "Allowed memory size of 1073741824 bytes exhausted (tried to allocate 26811 bytes)" message here. ]
$pages = $pdf->getPages();

Aug 31 '16 11:08 vishaldevrepublic

Thank you @citionzeno, your suggestion works for me.

Sep 19 '16 15:09 yapsr

Glad it helped. It works if you are trying to parse multiple small files. For one single large file however, like @vishaldevrepublic, I don't know how to do. Some deep work into the library might be needed.

Sep 19 '16 23:09 ghost

I am having the same problem. Trying to parse multiple PDF files within the same script ended up with a huge memory leak. With PHP 7 (or PHP >= 5.3), you can use gc_collect_cycles(); to call the garbage collector in order to delete objects with circular references. The memory usage goes back to normal for me after this call.

Where do i have to do that garbage collection in the code please ?

Dec 20 '19 15:12 amineharoun

This suggestion only works for parsing multiple small files. In this case, you can call gc_collect_cycles() after parsing each file, and before parsing the next one. This trick however does not provide a solution to the case where you want to parse a single large file.

Dec 20 '19 15:12 fm89

Thanks for reply, can you tell me how I can parse a big file (60MB) that have 90 pages ? My script crash with 503, my CPU has achived 100%

Dec 20 '19 16:12 amineharoun

This package does not appear to be a good solution for large files. See #169 also.

Dec 20 '19 17:12 fm89

Can you tell me other solutions to use please ? (PHP)

Dec 20 '19 17:12 amineharoun

Hi, we have a memory leak with specific PDFs, like: Allowed memory size of 134217728 bytes exhausted (tried to allocate 2097160 bytes) in /var/www/vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php on line 197

If I now open the used PDF in macos' preview.app and export it there with "Export as PDF", the error will be gone on next try. Could not find the difference of the PDFs so far. The exported one even is bigger, but seems to lack of any adobe acrobat meta data (in hex editor view).

Dec 17 '20 16:12 Jurek-Raben

Can anyone provide a PDF file which causes this problem please? It must be free of charge (and other obligations) and will be part of our test environment.

Alternative would be the content of the $param you gave $parser->parseContent($param).

Dec 30 '20 10:12 k00ni

Is it related to #372?

Dec 30 '20 10:12 k00ni

@k00ni I have the same issue as @Jurek-Raben, there is a memory leak(memory exhausted error in Font.php, line 189) if I parse a file with the Khmer Language (Official language of Cambodia) characters inside or a certain file with a big map image inside(12MB size). By looking at the metadata they were both created by Microsoft Word 2016, if I re-save the pdfs with Preview then the parser works as expected.

Tried to solution in #372, but that did not resolve the issue.

Sadly I also can't share the PDFs publicly.

Jan 04 '21 11:01 nkoporec

Cannot share the pdf either, if it will publicly used...

Jan 07 '21 03:01 Jurek-Raben

If you can spare the time you could try to find a minimal example (=string) to trigger the problem. Please take the following code:

$parser = new \Smalot\PdfParser\Parser();

// load PDF file content
$data = file_get_contents('/path/to/pdf');

// give PDF content to function and parse it
$pdf = $parser->parseContent($data); // <= should trigger the leak

If your PDF triggers the leak, try to reduce the content of $data as much as possible. After you got to a reasonable length (thats up to you), you post it here. We will use it in our tests to reproduce the problem.

Here is a good example how that could look: https://github.com/smalot/pdfparser/issues/372#issuecomment-754025473

Jan 07 '21 08:01 k00ni

In my scripts, I parse a lot of PDFs and after a while, the out of memory error occurs. Continuing the script with the latest PDF from the former batch, the error occurs some PDFs later. So in my view, the error is not reproducable with a single PDF.

Jan 07 '21 09:01 yapsr

You are right, I remember the post from @ghost (https://github.com/smalot/pdfparser/issues/104#issuecomment-241813856). He mentioned something about memory, which is never freed.

So in my view, the error is not reproducable with a single PDF.

I just wanna make sure there is no infinite loop or something which causes this problem.

@Jurek-Raben said:

Hi, we have a memory leak with specific PDFs, like: Allowed memory size of 134217728 bytes exhausted (tried to allocate 2097160 bytes) in /var/www/vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php on line 197

Font.php around line 197 seems fine. In my opinion it could be caused by either an infinite loop (or recursion) or memory, which is used but never freed.

@yapsr Do you read your PDFs in one or multiple script runs?

Jan 07 '21 11:01 k00ni

@yapsr Do you read your PDFs in one or multiple script runs?

I try to read multiple small PDFs (200kB in size) with a background image (4961x7016px PNG) and some text in a single script, but the script always crashes after a seemingly random number of PDF reads in "vendor/tecnickcom/tcpdf/include/tcpdf_filters.php:357" with this message:

" exception 'Symfony\Component\Debug\Exception\FatalErrorException' with message 'Allowed memory size of 536870912 bytes exhausted (tried to allocate 116907502 bytes)' in /home/user/project/vendor/tecnickcom/tcpdf/include/tcpdf_filters.php:357"

There I find this code:


        /**
         * FlateDecode
         * Decompresses data encoded using the zlib/deflate compression method, reproducing the original text or binary data.
         * @param $data (string) Data to decode.
         * @return Decoded data string.
         * @since 1.0.000 (2011-05-23)
         * @public static
         */
        public static function decodeFilterFlateDecode($data) {   
                // initialize string to return
                $decoded = @gzuncompress($data);
                if ($decoded === false) {
                        self::Error('decodeFilterFlateDecode: invalid code');
                }
                return $decoded;
        }

So it might have to do something with gzuncompress()

The server runs PHP 5.6.27-1+deb.sury.org~trusty+1 (cli)

Here is some relevant part of my log files:

[2020-12-11 11:11:01] production.DEBUG: ParsePDF::select() : Exporting 21 files... [] []
[2020-12-11 11:11:01] production.DEBUG: ParsePDF::handle() : handling file 12345664, using 12MB [] []
[2020-12-11 11:11:01] production.DEBUG: ParsePDF::handle() : handling file 12345674, using 126MB [] []
[2020-12-11 11:11:02] production.DEBUG: ParsePDF::handle() : handling file 12345669, using 237MB [] []
[2020-12-11 11:11:02] production.DEBUG: ParsePDF::handle() : handling file 12345684, using 349MB [] []
[2020-12-11 11:11:03] production.DEBUG: ParsePDF::handle() : handling file 12345696, using 349MB [] []
[2020-12-11 11:11:04] production.DEBUG: ParsePDF::handle() : handling file 12345665, using 460MB [] []
[2020-12-11 11:11:04] production.ERROR: exception 'Symfony\Component\Debug\Exception\FatalErrorException' with message 'Allowed memory size of 536870912 bytes exhausted (tried to allocate 116907502 bytes)'...

The script seems to add about 111MB of memory usage per PDF file.

When converting the PNG background image to BMP format manually, it turns out to be about 104MB in size. So that does look related to the memory leak.

Hope this helps locating the problem.

Jan 07 '21 16:01 yapsr

Which version of PDFParser do you use?

You mentioned vendor/tecnickcom/tcpdf/include/tcpdf_filters.php:357, but we removed TCPDF a few versions ago. Can you try again with our latest version 0.18.0 please? I remember that I removed the @ in $decoded = @gzuncompress($data); to allow error reporting.

Jan 08 '21 07:01 k00ni

Oops. We were using pdfparser v0.9.25. However, removing tcpdf and updating to v0.18.1, I still get the memory usage error. Adding the gc_collect_cycles() workaround prevents the memory limit exception.

Jan 12 '21 22:01 yapsr

Does something speak against gc_collect_cycles() after each parseFile call? Besides finding the root cause of this problem of course.

CC @Connum @j0k3r

Jan 13 '21 08:01 k00ni

Using gc_collect_cycles() is just the workaround. Without a proper script + pdf to reproduce the leak, I think we won't be able to properly fix it ...

Jan 13 '21 08:01 j0k3r

I managed to create a PDF file to reproduce the issue:

document_with_text_and_png_image.pdf

    $file = 'document_with_text_and_png_image.pdf';
    $loops = 10;
    for($i=0;$i<$loops;$i++) {
        $parser = new \Smalot\PdfParser\Parser(); // v0.18.1
        $pdf = $parser->parseFile($file);
        echo memory_get_usage() . PHP_EOL;
    }

Memory usage will increase with more than 100MB per loop. The PDF file and it's included image are relatively small (111kB). The only cause I can think of is the PNG image, that is really large in byte size when uncompressed.

Jan 13 '21 12:01 yapsr

I've run the script 10 times using Blackfire, but I don't really know how to see where the leak is: https://blackfire.io/profiles/806c9126-a571-4472-8af6-664a8e34a5b7/graph (might only be available 24h I think)

But yeah, it leaks a lot:

$ php try.php
105778808
210245032
314711224
419177416
523643608
628109800
732575992
837050376
941516568
1045982760
$

Jan 13 '21 13:01 j0k3r

If it is indeed the image, any idea how to work around it? As there's no OCR functionality built into the library, it could simply discard images completely - but I'm not deep enough into the core of the library to see how and where that could me managed, or if it is a feasible approach at alll.

Jan 13 '21 13:01 Connum

@smalot do you have an idea how to fix this? It "soft" blocks #383.

Mar 09 '21 13:03 k00ni

I am having a memory leak problem analyzing a single PDF. From hundreds of PDFs I have, some of them caused a memory leak that explodes at the memory limit (even small PDFs with less than 1 MB). In my case this happens because internally some code enters in a infinite loop. The infinite loop happens when the method getXrefData calls the method decodeXref, and the method decodeXref calls the method getXrefData. (getXrefData is called first) - /Smalot/PdfParser/RawData/RawDataParser.php

So I am assuming we are talking about two problems.

Testing a bunch of files causes some memory leak that can or not reach the memory limit.
Some unique PDFs can itself cause a memory leak that will raise until the memory explodes (this is my problem).

Unfortunately I can't provide any of my problematic PDFs because they are private data, but I thought this info could help the devs in some way.

Jun 18 '21 15:06 llagerlof

@llagerlof Are you willing to dive a little bit into the code to get us some debug information?

Please paste the parameter values of the first getXrefData call, which triggers an infinite loop. Parameters should be as small as possible, just enough to trigger the memory leak.

Thanks in advance.

Jun 21 '21 07:06 k00ni

pdfparser pdfparser copied to clipboard

Memory Leak

pdfparser
pdfparser copied to clipboard