tesserocr icon indicating copy to clipboard operation
tesserocr copied to clipboard

tesserocr v2.4.0 is leaking handles

Open chrisoro opened this issue 6 years ago • 10 comments

I can observe on more than one pc (linux and windows) that calling image_to_text results in open handles that are not released at all.

Just create a while True loop for image_to_text and observe the handle count.

Same goes for tesserocr v2.3.1 based on tesseract 4. In tesserocr v2.3.1 based on tesseract 3 it works fine.

Is that the wrapper or tesseract 4 itself?

chrisoro avatar Jul 26 '19 19:07 chrisoro

This function hasn't changed for a while iirc and I see the function already does all necessary cleanup. Maybe there needs to be some extra cleanup with tesseract 4? Any input on this is appreciated.

Did you try testing v2.4.0 with tesseract 3?

sirfz avatar Jul 26 '19 20:07 sirfz

I used the wheels provided by @simonflueckiger and he doesn't have that combination.

I use your project to basically interface with tesseract in python without the hassle of doing process/c calls myself. I found out that during long runs of my software, the performance decreases drastically and sometimes it hangs. That lead me to observing ~1-2 new handles on each image_to_text call and since I do lots of OCR it resulted in 500k open handles after a few hours.

I traced it back to your function but didn't investigate further. Have you changed anything vital from 2.3.1 to 2.4? If not, it must be tesseract.

Would be nice if you can confirm my findings since I don't want to create fuss with something that might be my fault (wrong usage?).

What I did find out is that it seems the code itself is fine but it's up to the tessdata that you use. If I use the tessdata for 3.x it works fine but if I use 4.0 tessdata handles start leaking.

import PIL.Image
import tesserocr

FILE_PATH = 'ADD_RANDOM_IMAGE'
IMAGE = PIL.Image.open(FILE_PATH)

while True:
	tesserocr.image_to_text(IMAGE, lang="deu_best")
	tesserocr.file_to_text(FILE_PATH, lang="deu_best")

This does leak. Changing the lang to a 3.x traineddata it doesn't leak. Same results for english. So in short: 2.4 with tesseract 4 and traineddata 4 -> leak 2.4 with tesseract 4 and traineddata 3 -> no leak 2.3.1 with tesseract 4 and traineddata 4 -> leak 2.3.1 with tesseract 4 and traineddata 3 -> no leak 2.3.1 with tesseract 3 and traineddata 3 -> no leak

So I guess it's the tesseract/tessdata combination that somehow leaks?!

chrisoro avatar Jul 27 '19 08:07 chrisoro

Perhaps it's some feature available with traindata v4 that's causing this. However, you're not using the most efficient way of using tesserocr because image_to_text and file_to_text will instantiate a new PyTessBaseAPI instance per call which adds lots of overhead. A better approach would be something like:

import tesserocr
from PIL import Image

api = tesserocr.PyTessBaseAPI(lang='deu_best')

for filename in images_list:
    image = Image.open(filename)
    api.SetImage(image)
    # you can use api.SetImageFile(filename) instead if you don't need the image in Python
    api.GetUTF8Text()

It'd be interesting to see if you get the same behavior using this code.

sirfz avatar Jul 29 '19 14:07 sirfz

Interesting.. I guess there seems to be an issue in the PyTessBaseAPI init? Using tessdata v4:

Leak:

while True:
	api = tesserocr.PyTessBaseAPI(lang='deu')
	api.SetImageFile(FILE_PATH)
	api.GetUTF8Text()

No leak:

api = tesserocr.PyTessBaseAPI(lang='deu')
api.SetImageFile(FILE_PATH)
while True:
	api.GetUTF8Text()

Also performance wise there is a huge difference. 100 runs on some random high res image: My code: 1.191847005s Your code: 0.1189847005s

But this seems to be due to some kind of caching? Reducing the runs to 1 I get: My code: 1.2133466000000002 Your code: 0.9077446000000002

chrisoro avatar Jul 29 '19 16:07 chrisoro

In your frist example, you need to destroy the api (which is the behavior in the *_to_text helper functions):

while True:
    with tesserocr.PyTessBaseAPI(lang='deu') as api:
        api.SetImageFile(FILE_PATH)
        api.GetUTF8Text()

If you don't use the context manager then you'll have to manually call api.End() to properly finalize it.

sirfz avatar Jul 29 '19 16:07 sirfz

Ok I understand. But first example behaves the same (leaking..) as using the *_to_text helpers. Maybe the code there isn't working properly? Can you reproduce my results?

Anyways, thanks for the help, my issue is solved by just not using the helper and instead using your proposed solution.

Curiosity is still there, why helper leaks when using tessdata v4

chrisoro avatar Jul 29 '19 17:07 chrisoro

Leaking is to be expected if the api is not finalized within the loop, you'll have to run it the way I wrote it in my last comment (unless you already did and saw the same results).

sirfz avatar Jul 29 '19 17:07 sirfz

It seems that the problem is in api.End() since it doesn't free up any handles. Neither by doing it manually or using the context.

Can you please run your code https://github.com/sirfz/tesserocr/issues/188#issuecomment-516074118 using some kind of high res image with any v4 traindata?

I don't even need an image to reproduce the handles leaking just enter/exit is enough:

while True:
  with tesserocr.PyTessBaseAPI(lang='deu'):
    pass

chrisoro avatar Jul 29 '19 17:07 chrisoro

@sirfz can you reproduce the issue by running the infinite loop? Just use tesseract4, tesserocr 2.4 and tessdata4. any lang (I checked eng, deu)

chrisoro avatar Aug 07 '19 11:08 chrisoro

Hey! Is there any update on this? I'm facing exactly the same behaviour with the following code:

while True:
    api = PyTessBaseAPI()
    api.End()

Handles increase further and further. They all seem to be mutex handles. I've been trying to figure this out for quite a while. For me it seems like the python wrapping around the dll isn't freeing anything before interpreter completes. I always tried to manually let the garbage collector collect.

finkformatics avatar Nov 17 '21 16:11 finkformatics