tesserocr
tesserocr copied to clipboard
tesserocr v2.4.0 is leaking handles
I can observe on more than one pc (linux and windows) that calling image_to_text results in open handles that are not released at all.
Just create a while True loop for image_to_text and observe the handle count.
Same goes for tesserocr v2.3.1 based on tesseract 4. In tesserocr v2.3.1 based on tesseract 3 it works fine.
Is that the wrapper or tesseract 4 itself?
This function hasn't changed for a while iirc and I see the function already does all necessary cleanup. Maybe there needs to be some extra cleanup with tesseract 4? Any input on this is appreciated.
Did you try testing v2.4.0 with tesseract 3?
I used the wheels provided by @simonflueckiger and he doesn't have that combination.
I use your project to basically interface with tesseract in python without the hassle of doing process/c calls myself. I found out that during long runs of my software, the performance decreases drastically and sometimes it hangs. That lead me to observing ~1-2 new handles on each image_to_text call and since I do lots of OCR it resulted in 500k open handles after a few hours.
I traced it back to your function but didn't investigate further. Have you changed anything vital from 2.3.1 to 2.4? If not, it must be tesseract.
Would be nice if you can confirm my findings since I don't want to create fuss with something that might be my fault (wrong usage?).
What I did find out is that it seems the code itself is fine but it's up to the tessdata that you use. If I use the tessdata for 3.x it works fine but if I use 4.0 tessdata handles start leaking.
import PIL.Image
import tesserocr
FILE_PATH = 'ADD_RANDOM_IMAGE'
IMAGE = PIL.Image.open(FILE_PATH)
while True:
tesserocr.image_to_text(IMAGE, lang="deu_best")
tesserocr.file_to_text(FILE_PATH, lang="deu_best")
This does leak. Changing the lang to a 3.x traineddata it doesn't leak. Same results for english. So in short: 2.4 with tesseract 4 and traineddata 4 -> leak 2.4 with tesseract 4 and traineddata 3 -> no leak 2.3.1 with tesseract 4 and traineddata 4 -> leak 2.3.1 with tesseract 4 and traineddata 3 -> no leak 2.3.1 with tesseract 3 and traineddata 3 -> no leak
So I guess it's the tesseract/tessdata combination that somehow leaks?!
Perhaps it's some feature available with traindata v4 that's causing this. However, you're not using the most efficient way of using tesserocr because image_to_text and file_to_text will instantiate a new PyTessBaseAPI instance per call which adds lots of overhead. A better approach would be something like:
import tesserocr
from PIL import Image
api = tesserocr.PyTessBaseAPI(lang='deu_best')
for filename in images_list:
image = Image.open(filename)
api.SetImage(image)
# you can use api.SetImageFile(filename) instead if you don't need the image in Python
api.GetUTF8Text()
It'd be interesting to see if you get the same behavior using this code.
Interesting.. I guess there seems to be an issue in the PyTessBaseAPI init? Using tessdata v4:
Leak:
while True:
api = tesserocr.PyTessBaseAPI(lang='deu')
api.SetImageFile(FILE_PATH)
api.GetUTF8Text()
No leak:
api = tesserocr.PyTessBaseAPI(lang='deu')
api.SetImageFile(FILE_PATH)
while True:
api.GetUTF8Text()
Also performance wise there is a huge difference. 100 runs on some random high res image: My code: 1.191847005s Your code: 0.1189847005s
But this seems to be due to some kind of caching? Reducing the runs to 1 I get: My code: 1.2133466000000002 Your code: 0.9077446000000002
In your frist example, you need to destroy the api (which is the behavior in the *_to_text helper functions):
while True:
with tesserocr.PyTessBaseAPI(lang='deu') as api:
api.SetImageFile(FILE_PATH)
api.GetUTF8Text()
If you don't use the context manager then you'll have to manually call api.End() to properly finalize it.
Ok I understand. But first example behaves the same (leaking..) as using the *_to_text helpers. Maybe the code there isn't working properly? Can you reproduce my results?
Anyways, thanks for the help, my issue is solved by just not using the helper and instead using your proposed solution.
Curiosity is still there, why helper leaks when using tessdata v4
Leaking is to be expected if the api is not finalized within the loop, you'll have to run it the way I wrote it in my last comment (unless you already did and saw the same results).
It seems that the problem is in api.End() since it doesn't free up any handles. Neither by doing it manually or using the context.
Can you please run your code https://github.com/sirfz/tesserocr/issues/188#issuecomment-516074118 using some kind of high res image with any v4 traindata?
I don't even need an image to reproduce the handles leaking just enter/exit is enough:
while True:
with tesserocr.PyTessBaseAPI(lang='deu'):
pass
@sirfz can you reproduce the issue by running the infinite loop? Just use tesseract4, tesserocr 2.4 and tessdata4. any lang (I checked eng, deu)
Hey! Is there any update on this? I'm facing exactly the same behaviour with the following code:
while True:
api = PyTessBaseAPI()
api.End()
Handles increase further and further. They all seem to be mutex handles. I've been trying to figure this out for quite a while. For me it seems like the python wrapping around the dll isn't freeing anything before interpreter completes. I always tried to manually let the garbage collector collect.