Pankaj Gupta
Pankaj Gupta
@johnynek curios what alternatives to options you considered. I don't think many of our users grok the way application of settings via naming flows through the graph, so I agree...
One solution could be to keep the model load thread around for the lifetime of the inference server process. The thread should exit to allow for graceful termination. This can...
Thanks, great to know that the engine supports fp32 LoRA. The model is indeed private, let me provide details shortly in the oss alternative.
Would appreciate any updates on this issue. thx
Thanks for the update.
I think the real issue seems to be that once custom mask is set for the wrapper, it gets used even if there's a subsequent call without it. It seems...
We'll need to pass the right settings in https://github.com/basetenlabs/truss/blob/main/truss/truss_handle.py#L194 https://docs.docker.com/config/containers/resource_constraints/
> What's the expected behavior around the `use_gpu` flag and the new `accelerator` spec? Good point, this needs more thought. It's unlikely that someone would have the same accelerator locally...