sbb_binarization OCR-D processor is leaky

When processing a document of 1.5k pages of medium size (1-2 MP each), I am observing a slow but steady increase in RSS from 4 GB up to 14 GB after 1.2k pages at which point the process gets crashed by the OS (Killed).

I do not see any Python bindings accessible to the input file loop which could accumulate such data without ever being GCed.

I am on CUDA 11.8

Has anybody seen this before?

Mar 01 '24 14:03 bertsky

I've seen these kinds of memory leaks happen with TF 1, but AFAICR not with TF 2. (See https://github.com/qurator-spk/sbb_column_classifier - I think just upgrading fixed it, but maybe the "TF best practices" were necessary too.)

Apr 25 '24 18:04 mikegerber

What I describe happens on TF 2.13.1, which should be fully supported.

This issue is a show-stopper for me, as with OCR-D, it's not even possible to keep the results already produced (since they are only persisted in the METS at the end of the loop).

@mikegerber what do you mean by TF Best Practices – some particular document perhaps?

Apr 29 '24 12:04 bertsky

@mikegerber what do you mean by TF Best Practices – some particular document perhaps?

The things I did in sbb_column_classifier to make it process ~ 20 million pages:

1a. Updating to TF2 1b. IIRC using TF graph execution, TF functions (JIT?) 2. Dealing with flow problems due to the interweaved CPU processing (Would probably look into using some kind of bounded queues now, but solved it using semaphores at the time.)

I'm not sure if I did 1b to fix any memory leaks, may have just been for better performance.

May 07 '24 11:05 mikegerber

sbb_binarization sbb_binarization copied to clipboard

OCR-D processor is leaky

sbb_binarization
sbb_binarization copied to clipboard