grobid icon indicating copy to clipboard operation
grobid copied to clipboard

Significant amounts of timeouts while using threading on Grobid Docker Service

Open matthieu-perso opened this issue 1 year ago • 2 comments

Configuration

  • Using the Docker Service both locally (Mac 2017, 8GB ram) and as a GCP Cloud Run Instance (4GB RAM, 80 threads)

Problem

  • In both these cases, I tried to speed up training by implementing a very basic Thread Pool calling the services.
  • Both locally and in the cloud, I get 20% of my threads time-ing out with the standard Grobid message [TIMEOUT] PDF to XML conversion timed out.
  • Even with a low number of workers (5), I still get a significant number of timeouts.
  • I'd assume my machines are powerful enough to run the software, so it wouldn't be capacity limits - but my knowledge here is clearly limited.

What would be the reason the service times out so fast ? Any workarounds if I wish for all requests to be completed ?

Code (for the local instance, identical cloud one except for url and token )

import concurrent.futures 
import time
import requests
import glob
import time
start_time = time.time()


def requesting(url, index):
    '''Requests GROBID service'''
    cloud_token = ""
    headers = {
        'Authorization': f"bearer {token}"}

    files = {
        'input': open(url, 'rb')}
    response = requests.post('http://localhost:8070/api/processFulltextDocument', files=files, headers=headers)
    return response.text, index

def main()
    filelist = glob.glob('./download/unpacked/**/*.pdf', recursive=True)

   with concurrent.futures.ThreadPoolExecutor(max_workers=5) as thread_pool: 
      futures = []
      for index, url in enumerate(filelist):
         futures.append(thread_pool.submit(requesting, url, index)) 

   for future in concurrent.futures.as_completed(futures): 
        data, index = future.result()
        with open(f'thread_{index}.xml', 'w') as f:
            f.write(data)

if __name__ == '__main__':
      main()
      print("--- %s seconds ---" % (time.time() - start_time))

matthieu-perso avatar Aug 08 '22 12:08 matthieu-perso

Hello @MatthieuMoullecDev !

Thank you for the interest in Grobid and the issue.

You can use the Grobid python client, which is very well tested and has been able to scale to 12M PDF. Without managing the server availability (503 responses), you will get for sure these timeouts, but the python client is managing them for you.

Then the main adaptation to avoid timeout is on the server settings. You can have a look at the FAQ entry on the topic here. Two important aspects I think from your description are the amount of RAM memory and the number of threads. The settings for threads in the client and the grobid server need to be aligned with the real number of available threads available on the server.

kermitt2 avatar Aug 08 '22 20:08 kermitt2

Hey Patrice,

Thanks for your quick and helpful reply !

I saw the Python client but was struggling with an error I managed to debug (write-up here). I will have a go with it.

Thanks for the link to the production FAQs, will follow these guidelines and go from there.

matthieu-perso avatar Aug 09 '22 12:08 matthieu-perso