javacpp-presets icon indicating copy to clipboard operation
javacpp-presets copied to clipboard

Memory consumption with tesseract

Open bskorka opened this issue 1 year ago • 5 comments

Hey! While we are running our app in a container we've observed that the PDF parser app often extends the memory that we've set in Kubernetes. I started to tweak the settings etc but still had some issues with understanding why the memory consumption is so high, and why doesn't it decrease over time.

Code looks like that:

public String ocrPdfAndExtractText(PDDocument pdDocument) {
        PDFRenderer pdfRenderer = new PDFRenderer(pdDocument);
        StringBuilder out = new StringBuilder();

        Instant start = Instant.now();
        try (TessBaseAPI tessApi = new TessBaseAPI()) {
            tessApi.Init(tessdataPath, "nor");
            for (int page = 0; page < pdDocument.getNumberOfPages(); page++) {
                log.debug("OCR page {} of {}", page + 1, pdDocument.getNumberOfPages());
                File temp = convertPageToImage(pdfRenderer, page);

                PIX image = pixRead(temp.getAbsolutePath());

                tessApi.SetImage(image);
                BytePointer outText = tessApi.GetUTF8Text();
                out.append(outText.getString());

                outText.deallocate();
                pixDestroy(image);
                temp.delete();
            }
            tessApi.End();
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
        log.debug("OCR of PDF file took {} ms", Duration.between(start, Instant.now()).toMillis());

        return out.toString();
    }

    private File convertPageToImage(PDFRenderer pdfRenderer, int page) throws IOException {
        BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
        File temp = File.createTempFile("tempfile_" + page, ".png");
        ImageIO.write(bim, "png", temp);
        return temp;
    }

While testing locally I am running the app with VM settings:

VM settings:
    Min. Heap Size: 300.00M
    Max. Heap Size: 768.00M
    Using VM: OpenJDK 64-Bit Server VM

The app runs correctly with 450~MB in the process manager, after running 4 or 5 processes of OCR for several files the memory in the activity manager on Mac shows ~1.6GB, which doubles the heap size. After running several more it goes up to 2GB, while the heap is at a maximum of 768MB and real memory usage is around 400MB.

If I ran GC manually through the profiler it just retrieves memory from the heap, but the process is still at 1.6-2GB.

The outcome from jcmd at the start:

Total: reserved=2040MB, committed=674MB

-                 Java Heap (reserved=768MB, committed=456MB)
                            (mmap: reserved=768MB, committed=456MB)
 
-                     Class (reserved=1025MB, committed=11MB)
                            (classes #15745 +1)
                            (  instance classes #14769 +1, array classes #976)
                            (malloc=1MB #30343 +9)
                            (mmap: reserved=1024MB, committed=9MB)
                           : (  Metadata)
                            (    reserved=56MB, committed=56MB)
                            (    used=56MB)
                            (    waste=0MB =0.40%)
                           : (  Class space)
                            (    reserved=1024MB, committed=9MB)
                            (    used=9MB)
                            (    waste=0MB =2.39%)
 
-                    Thread (reserved=42MB, committed=42MB)
                            (thread #0)
                            (stack: reserved=42MB, committed=42MB)
 
-                      Code (reserved=49MB, committed=12MB)
                            (malloc=1MB #7227 +3)
                            (mmap: reserved=48MB, committed=11MB)
 
-                        GC (reserved=59MB, committed=58MB)
                            (malloc=31MB #142)
                            (mmap: reserved=28MB, committed=27MB)
 
-                     Other (reserved=4MB, committed=4MB)
                            (malloc=4MB #29)
 
-                    Symbol (reserved=16MB, committed=16MB)
                            (malloc=14MB #387379 +11)
                            (arena=2MB #1)
 
-    Native Memory Tracking (reserved=7MB, committed=7MB)
                            (tracking overhead=7MB)
 
-        Shared class space (reserved=12MB, committed=12MB)
                            (mmap: reserved=12MB, committed=12MB)
 
-                 Metaspace (reserved=56MB, committed=56MB)
                            (mmap: reserved=56MB, committed=56MB)

And after running some of the OCR processes:

Total: reserved=2083MB +43MB, committed=1032MB +358MB

-                 Java Heap (reserved=768MB, committed=768MB +312MB)
                            (mmap: reserved=768MB, committed=768MB +312MB)
 
-                     Class (reserved=1026MB, committed=13MB +2MB)
                            (classes #18477 +2733)
                            (  instance classes #17400 +2632, array classes #1077 +101)
                            (malloc=2MB #36877 +6543)
                            (mmap: reserved=1024MB, committed=11MB +2MB)
                           : (  Metadata)
                            (    reserved=72MB +16MB, committed=68MB +12MB)
                            (    used=67MB +12MB)
                            (    waste=0MB =0.48%)
                           : (  Class space)
                            (    reserved=1024MB, committed=11MB +2MB)
                            (    used=11MB +2MB)
                            (    waste=0MB =1.93%)
 
-                    Thread (reserved=56MB +14MB, committed=56MB +14MB)
                            (thread #0)
                            (stack: reserved=56MB +14MB, committed=56MB +14MB)
 
-                      Code (reserved=50MB, committed=16MB +4MB)
                            (malloc=1MB #10169 +2945)
                            (mmap: reserved=48MB, committed=15MB +4MB)
 
-                        GC (reserved=59MB, committed=59MB +1MB)
                            (malloc=31MB #142)
                            (mmap: reserved=28MB, committed=28MB +1MB)
 
-                     Other (reserved=13MB +9MB, committed=13MB +9MB)
                            (malloc=13MB +9MB #43 +14)
 
-                    Symbol (reserved=18MB +3MB, committed=18MB +3MB)
                            (malloc=16MB +2MB #455962 +68594)
                            (arena=2MB #1)
 
-    Native Memory Tracking (reserved=8MB +1MB, committed=8MB +1MB)
                            (tracking overhead=8MB +1MB)
 
-        Shared class space (reserved=12MB, committed=12MB)
                            (mmap: reserved=12MB, committed=12MB)
 
-                 Metaspace (reserved=72MB +16MB, committed=68MB +12MB)
                            (mmap: reserved=72MB +16MB, committed=68MB +12MB)

It does not look like it should grow that rapidly - the off-heap memory looks fine in my opinion.

Should I deallocate or close something else? Is there something that GC is not cleaning up, and is allocated? Do you see any obvious mistakes here?

bskorka avatar Jun 13 '23 22:06 bskorka

Try to use PointerScope, it's designed for that: http://bytedeco.org/news/2018/07/17/bytedeco-as-distribution/

saudet avatar Jun 13 '23 23:06 saudet

Hey! I've surrounded my code with try (PointerScope scope = new Pointer Scope()) { ... }, and it saves a little memory I believe, but in a case like above, when we have a heap around ~700MB, the total memory of the JVM process is around 1,7GB. Is that common memory consumption here with Tesseract?

bskorka avatar Jun 14 '23 09:06 bskorka

Please try to set the "org.bytedeco.javacpp.nopointergc" system property to "true".

saudet avatar Jun 14 '23 12:06 saudet

Tried also that but the memory consumption is unchanged. Could Tesseract just need ~1GB of memory out of the heap?

bskorka avatar Jun 19 '23 06:06 bskorka

I guess it's possible it could use that much off heap memory, but there's nothing in JavaCPP or Tesseract that would use that much heap memory, no. What does your profiler say is using all that memory?

saudet avatar Jun 21 '23 04:06 saudet