tim-a-davis
tim-a-davis
+1 Adding this in as an option seems like a no brainer. The optimum-nvidia team has said they plan to support more models soon.
Hey @Narsil thanks for the reply. As far as throughput goes though, on the huggingface blog, they are claiming to reach speeds of 1200 tokens/second on 7-billion parameter models. I...
Yes I am also interested in getting support for MPT models. I would love to assist in any way I can.
it's very slow. This model is not supported for sharding at the moment in text-generation-inference. > did you try --trust-remote-code while running the docker
> Then try implementing a rudimentary implementation of it, you can use rust or js as router and Python for inference, copy the custom kernels from the repo, modify them...