javacpp-presets
javacpp-presets copied to clipboard
Memory consumption with tesseract
Hey! While we are running our app in a container we've observed that the PDF parser app often extends the memory that we've set in Kubernetes. I started to tweak the settings etc but still had some issues with understanding why the memory consumption is so high, and why doesn't it decrease over time.
Code looks like that:
public String ocrPdfAndExtractText(PDDocument pdDocument) {
PDFRenderer pdfRenderer = new PDFRenderer(pdDocument);
StringBuilder out = new StringBuilder();
Instant start = Instant.now();
try (TessBaseAPI tessApi = new TessBaseAPI()) {
tessApi.Init(tessdataPath, "nor");
for (int page = 0; page < pdDocument.getNumberOfPages(); page++) {
log.debug("OCR page {} of {}", page + 1, pdDocument.getNumberOfPages());
File temp = convertPageToImage(pdfRenderer, page);
PIX image = pixRead(temp.getAbsolutePath());
tessApi.SetImage(image);
BytePointer outText = tessApi.GetUTF8Text();
out.append(outText.getString());
outText.deallocate();
pixDestroy(image);
temp.delete();
}
tessApi.End();
} catch (IOException e) {
throw new RuntimeException(e);
}
log.debug("OCR of PDF file took {} ms", Duration.between(start, Instant.now()).toMillis());
return out.toString();
}
private File convertPageToImage(PDFRenderer pdfRenderer, int page) throws IOException {
BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
File temp = File.createTempFile("tempfile_" + page, ".png");
ImageIO.write(bim, "png", temp);
return temp;
}
While testing locally I am running the app with VM settings:
VM settings:
Min. Heap Size: 300.00M
Max. Heap Size: 768.00M
Using VM: OpenJDK 64-Bit Server VM
The app runs correctly with 450~MB in the process manager, after running 4 or 5 processes of OCR for several files the memory in the activity manager on Mac shows ~1.6GB, which doubles the heap size. After running several more it goes up to 2GB, while the heap is at a maximum of 768MB and real memory usage is around 400MB.
If I ran GC manually through the profiler it just retrieves memory from the heap, but the process is still at 1.6-2GB.
The outcome from jcmd at the start:
Total: reserved=2040MB, committed=674MB
- Java Heap (reserved=768MB, committed=456MB)
(mmap: reserved=768MB, committed=456MB)
- Class (reserved=1025MB, committed=11MB)
(classes #15745 +1)
( instance classes #14769 +1, array classes #976)
(malloc=1MB #30343 +9)
(mmap: reserved=1024MB, committed=9MB)
: ( Metadata)
( reserved=56MB, committed=56MB)
( used=56MB)
( waste=0MB =0.40%)
: ( Class space)
( reserved=1024MB, committed=9MB)
( used=9MB)
( waste=0MB =2.39%)
- Thread (reserved=42MB, committed=42MB)
(thread #0)
(stack: reserved=42MB, committed=42MB)
- Code (reserved=49MB, committed=12MB)
(malloc=1MB #7227 +3)
(mmap: reserved=48MB, committed=11MB)
- GC (reserved=59MB, committed=58MB)
(malloc=31MB #142)
(mmap: reserved=28MB, committed=27MB)
- Other (reserved=4MB, committed=4MB)
(malloc=4MB #29)
- Symbol (reserved=16MB, committed=16MB)
(malloc=14MB #387379 +11)
(arena=2MB #1)
- Native Memory Tracking (reserved=7MB, committed=7MB)
(tracking overhead=7MB)
- Shared class space (reserved=12MB, committed=12MB)
(mmap: reserved=12MB, committed=12MB)
- Metaspace (reserved=56MB, committed=56MB)
(mmap: reserved=56MB, committed=56MB)
And after running some of the OCR processes:
Total: reserved=2083MB +43MB, committed=1032MB +358MB
- Java Heap (reserved=768MB, committed=768MB +312MB)
(mmap: reserved=768MB, committed=768MB +312MB)
- Class (reserved=1026MB, committed=13MB +2MB)
(classes #18477 +2733)
( instance classes #17400 +2632, array classes #1077 +101)
(malloc=2MB #36877 +6543)
(mmap: reserved=1024MB, committed=11MB +2MB)
: ( Metadata)
( reserved=72MB +16MB, committed=68MB +12MB)
( used=67MB +12MB)
( waste=0MB =0.48%)
: ( Class space)
( reserved=1024MB, committed=11MB +2MB)
( used=11MB +2MB)
( waste=0MB =1.93%)
- Thread (reserved=56MB +14MB, committed=56MB +14MB)
(thread #0)
(stack: reserved=56MB +14MB, committed=56MB +14MB)
- Code (reserved=50MB, committed=16MB +4MB)
(malloc=1MB #10169 +2945)
(mmap: reserved=48MB, committed=15MB +4MB)
- GC (reserved=59MB, committed=59MB +1MB)
(malloc=31MB #142)
(mmap: reserved=28MB, committed=28MB +1MB)
- Other (reserved=13MB +9MB, committed=13MB +9MB)
(malloc=13MB +9MB #43 +14)
- Symbol (reserved=18MB +3MB, committed=18MB +3MB)
(malloc=16MB +2MB #455962 +68594)
(arena=2MB #1)
- Native Memory Tracking (reserved=8MB +1MB, committed=8MB +1MB)
(tracking overhead=8MB +1MB)
- Shared class space (reserved=12MB, committed=12MB)
(mmap: reserved=12MB, committed=12MB)
- Metaspace (reserved=72MB +16MB, committed=68MB +12MB)
(mmap: reserved=72MB +16MB, committed=68MB +12MB)
It does not look like it should grow that rapidly - the off-heap memory looks fine in my opinion.
Should I deallocate or close something else? Is there something that GC is not cleaning up, and is allocated? Do you see any obvious mistakes here?
Try to use PointerScope, it's designed for that: http://bytedeco.org/news/2018/07/17/bytedeco-as-distribution/
Hey!
I've surrounded my code with try (PointerScope scope = new Pointer Scope()) { ... }
, and it saves a little memory I believe, but in a case like above, when we have a heap around ~700MB, the total memory of the JVM process is around 1,7GB.
Is that common memory consumption here with Tesseract?
Please try to set the "org.bytedeco.javacpp.nopointergc" system property to "true".
Tried also that but the memory consumption is unchanged. Could Tesseract just need ~1GB of memory out of the heap?
I guess it's possible it could use that much off heap memory, but there's nothing in JavaCPP or Tesseract that would use that much heap memory, no. What does your profiler say is using all that memory?