Jay S
Jay S
You were right. I just found out that it was a pytorch issue. `import torch` has been causing the segfault this whole time. The based docker image I was using...
I think if you use Llama it might work. I was able to make it work on V100 gpus
I see! thanks for the responses. It totally makes sense. Is there a way to set-up the timeout as well?
Ok, when I tried this with a custom kernel it seems that the generation is stable (even with 128 async requests). I couldn't reproduce the error. However, I tried this...