tesserocr Parallel requests increases time

Hey, I have an API using this code to predict text from different images but I'm having trouble understanding why the performance is so bad when I'm running multiple requests in parallel.

 def recognize_text(image, lang, psm=PSM.SINGLE_LINE):
    recognized_text = []
    with PyTessBaseAPI(psm=psm, lang=lang) as api:
        api.SetSourceResolution(300)
        api.SetPageSegMode(psm)
        api.SetImage(image)
        api.SetVariable("tessedit_do_invert", "0")
        api.Recognize()
        ri = api.GetIterator()
        for r in iterate_level(ri, RIL.TEXTLINE):
            try:
                result = r.GetUTF8Text(RIL.TEXTLINE)
                recognized_text.append(result)
            except RuntimeError as exception:
                logging.exception(exception)
        return recognized_text

If I run one request the time is around 2 seconds but if I start running 10 requests at the same time it gets to 40 seconds. I've read a lot about how to optimise and get better times, set different Tesseract variables and configurations but still couldn't find a solution for this. I've also set OMP_THREAD_LIMIT to 1 but it's not enough.

Any ideas about this?

May 19 '20 15:05 oanamocean

You're initializing the API for each request which probably adds a significant overhead. Try initializing a pool of PyTessBaseAPI instances and use them in each thread and see if that improves the run time.

May 19 '20 16:05 sirfz

Also, because of GIL I recommend using multiprocessing instead of multithreading.

For the details it depends on whether you want to do batch processing (like on a bunch of files) or on demand processing (like in a server). For the former, see this example, for the latter, I recommend something based on mp.Queue and mp.Process.

Jul 02 '21 19:07 bertsky

tesserocr tesserocr copied to clipboard

Parallel requests increases time

tesserocr
tesserocr copied to clipboard