tim-a-davis

Results 5 comments of tim-a-davis

+1 Adding this in as an option seems like a no brainer. The optimum-nvidia team has said they plan to support more models soon.

Hey @Narsil thanks for the reply. As far as throughput goes though, on the huggingface blog, they are claiming to reach speeds of 1200 tokens/second on 7-billion parameter models. I...

Yes I am also interested in getting support for MPT models. I would love to assist in any way I can.

it's very slow. This model is not supported for sharding at the moment in text-generation-inference. > did you try --trust-remote-code while running the docker

> Then try implementing a rudimentary implementation of it, you can use rust or js as router and Python for inference, copy the custom kernels from the repo, modify them...