serve Worker has no effect in time inferent

📚 The doc issue

No different time inferent when I change minWorkers, maxWorkers from 1 to 2 or 3. Three different worker give me a same time to handle 1000 request by using multi.thread. This is a bug or I have mistake when setting config.properties?

My config.properties:

inference_address=http://0.0.0.0:8000
management_address=http://0.0.0.0:8001
metrics_address=http://0.0.0.0:8002
grpc_inference_port=7000
grpc_management_port=7001

cpu_launcher_enable=true
cpu_launcher_args=--use_logical_core
number_of_gpu = 0

models= {"my_tc": {"1.0": {"marName": "my_text_classifier.mar","minWorkers": 2,"maxWorkers": 2,"batchSize": 16,"maxBatchDelay": 20,"deviceType": "cpu"}}}

My code:

    import threading
    def temp():
        a =  requests.post("http://127.0.0.1:8000/predictions/my_tc", data = 'boxset')
        print(a.text)
        print(a.status_code)
    list_thread = [threading.Thread(target = temp) for i in range(100)]
    start = time()
    for thread in list_thread:
       thread.start()
    for thread in list_thread:
       thread.join()
    print(time() - start)

Suggest a potential alternative/fix

No response

Nov 27 '23 08:11 ToanLyHoa

Hi @ToanLyHoa You can refer to this PR which improved the CPU performance in our nightly benchmark https://github.com/pytorch/serve/pull/2166

You can use torchserve's benchmarking tool to configure the num_workers and benchmark performance

Nov 27 '23 19:11 agunapal

Hi @agunapal Can you answer me this question? If my model inferent 100 requests in 1 second with worker = 1, so if worker = 2 I can solve 100 requests in 0.5 second, do I understand it right?

Nov 28 '23 03:11 ToanLyHoa

@agunapal I tried to send 1000 request for worker=1, batchsize = 16 for my_text_classifier and it throw status_cpde 503 it said: "Model "my_tc" has no worker to serve inference request. Please use scale workers API to add workers." But when I set and worker = 16, batchsize = 1 there no status_code 503 anymore. But the speed of worker=1, batchsize = 16 same with worker = 16, batchsize = 1 or even worker = 16, batchsize = 16 when I send 100 requests. I thought speed have to different about x16 when I change worker and batchsize. Can you tell me what I misunderstand about torchserve.

Nov 28 '23 06:11 ToanLyHoa

@ToanLyHoa This is a bit more complicated. Writing a custom tool to measure this would require good understanding of TorchServe. Ex: For processing requests, the frontend has a queue size of 100 by default. So, depending on the model processing time, if you send 1000 requests concurrently, 900 of these can be dropped. So, you need to design the client taking this into account. There is nothing preventing multiple workers in TorchServe to process requests simultaneously. You will notice this more prominently in a multi GPU setup. Depending on what CPU you are using and how the OS schedules these processes, you could see the perf improvement. You can refer to how to improve perf on intel CPUs in this blog https://pytorch.org/tutorials/intermediate/torchserve_with_ipex.html

Finally, I would recommend using this benchmarking tool that we have. You can see an example here https://github.com/pytorch/serve/tree/master/examples/benchmarking/resnet50

You can set the number of workers, batch_size and see the effect it has on throughput/ latency.

Also, I noticed that you have a max_batch_delay set to 20. I would increase this . Also, you can add prints in your handler to see how many requests are being batched

Nov 28 '23 18:11 agunapal