serve
serve copied to clipboard
Do GPU and CPU block each other?
📚 Documentation
Did not find a documentation about processes/threads of GPU and CPU work. During gpu inference can we continue cpu pre/post-processing of other objects asynchronously?
If yes, which way do you send data between processes (queue/file/socket)? May be you can provide a link to the code. If no, do you have a plan to do it?
When you author a python which we call backend handler file it's spawned as a process by the Java part of the codebase which we call frontend. The frontend and backend communicate via sockets.
I believe your question is about whether we pipeline preprocessing in case an inference is slow, I'm not sure we do but maybe @lxning @HamidShojanazeri or @maaquib know
The way we scale is by increasing the number of workers in config.properties so imagine each worker is a different process with the same handler code so its embarrassingly parallel. One worker can be doing preprocessing while another could be doing inferencing.
If you're looking for source you can browse you can learn more here https://github.com/pytorch/serve/blob/master/docs/internals.md
@msaroufim Imagine that you have a video payload and you want to run inference on each frame. The decoding can be performed on CPU, and as soon as each frame (or batch of frames) gets decoded, we feed it to GPU for forward pass. This way we overlap CPU preproc with GPU model exec and can substantially reduce latencies + allow proc of arbitrary video length.
How (and if) can we do it with TorchServe today?
@alar0330 based on my experience GPU process does not block CPU workers. I did not find the code to prove it, but I did several tests to prove it to myself. But actually GPU needs its own CPU process and it loads a core, but if you have several cores, other cores will continue CPU bound preprocess tasks simultaneously.
I guess you question is a bit different. If you want to process video you have to decode it on client side and use torchserve only for images. If you will send whole video, then you will need to overwrite handle method and CPU-GPU will work synchronously. But I'm not sure
I'm sorry for the delay @alar0330 but it sounds like you're asking for pipelined execution when doing heavyweight preprocessing. As if today I don't believe we support this but could do something like this when I finish #1546
@msaroufim Imagine that you have a video payload and you want to run inference on each frame. The decoding can be performed on CPU, and as soon as each frame (or batch of frames) gets decoded, we feed it to GPU for forward pass. This way we overlap CPU preproc with GPU model exec and can substantially reduce latencies + allow proc of arbitrary video length.
How (and if) can we do it with TorchServe today?
@alar0330 if not mistaking, you are sending the whole video as one request? if its not streaming, then I think in a custom handler it should be doable, does something like this help?
class custom handler():
def initialize ():
load_model()
def frame_process(video):
processed_frame = process(video)
return processed_frame
def preprocess(request):
video = decode(reuqest)
def inference (video):
inferences =[]
number_of_frames = metadata(video)
for i in range(number_of_frames): # or we could make a buffer here
frame = frame_process(video) # or spawn multiple processes to process the video frames, not sure if there is any perf hit here.
ouptut = model(frame)
inferences.append(output)