serve Do GPU and CPU block each other?

📚 Documentation

Did not find a documentation about processes/threads of GPU and CPU work. During gpu inference can we continue cpu pre/post-processing of other objects asynchronously?

If yes, which way do you send data between processes (queue/file/socket)? May be you can provide a link to the code. If no, do you have a plan to do it?

Mar 28 '22 21:03 BraginIvan

When you author a python which we call backend handler file it's spawned as a process by the Java part of the codebase which we call frontend. The frontend and backend communicate via sockets.

I believe your question is about whether we pipeline preprocessing in case an inference is slow, I'm not sure we do but maybe @lxning @HamidShojanazeri or @maaquib know

The way we scale is by increasing the number of workers in config.properties so imagine each worker is a different process with the same handler code so its embarrassingly parallel. One worker can be doing preprocessing while another could be doing inferencing.

If you're looking for source you can browse you can learn more here https://github.com/pytorch/serve/blob/master/docs/internals.md

Mar 28 '22 23:03 msaroufim

@msaroufim Imagine that you have a video payload and you want to run inference on each frame. The decoding can be performed on CPU, and as soon as each frame (or batch of frames) gets decoded, we feed it to GPU for forward pass. This way we overlap CPU preproc with GPU model exec and can substantially reduce latencies + allow proc of arbitrary video length.

How (and if) can we do it with TorchServe today?

Apr 06 '22 17:04 alar0330

@alar0330 based on my experience GPU process does not block CPU workers. I did not find the code to prove it, but I did several tests to prove it to myself. But actually GPU needs its own CPU process and it loads a core, but if you have several cores, other cores will continue CPU bound preprocess tasks simultaneously.

Apr 06 '22 18:04 BraginIvan

I guess you question is a bit different. If you want to process video you have to decode it on client side and use torchserve only for images. If you will send whole video, then you will need to overwrite handle method and CPU-GPU will work synchronously. But I'm not sure

Apr 06 '22 18:04 BraginIvan

I'm sorry for the delay @alar0330 but it sounds like you're asking for pipelined execution when doing heavyweight preprocessing. As if today I don't believe we support this but could do something like this when I finish #1546

May 04 '22 03:05 msaroufim

@msaroufim Imagine that you have a video payload and you want to run inference on each frame. The decoding can be performed on CPU, and as soon as each frame (or batch of frames) gets decoded, we feed it to GPU for forward pass. This way we overlap CPU preproc with GPU model exec and can substantially reduce latencies + allow proc of arbitrary video length.

How (and if) can we do it with TorchServe today?

@alar0330 if not mistaking, you are sending the whole video as one request? if its not streaming, then I think in a custom handler it should be doable, does something like this help?

class custom handler():

    def initialize ():
        load_model()

    def frame_process(video):
        processed_frame = process(video)
        return processed_frame

    def preprocess(request):
        video = decode(reuqest)

    def inference (video):
        inferences =[]
        number_of_frames = metadata(video)
        for i in range(number_of_frames): # or we could make a buffer here 
            frame = frame_process(video) # or spawn multiple processes to process the video frames, not sure if there is any perf hit here. 
            ouptut = model(frame)
            inferences.append(output)

May 18 '22 05:05 HamidShojanazeri

serve serve copied to clipboard

Do GPU and CPU block each other?

📚 Documentation

serve
serve copied to clipboard