Nicolas Patry

Results 978 comments of Nicolas Patry

Hi @SinanAkkoyun , Whenever we get our hands on some H100 and can debug code for it, we will likely implement.

Thanks for the proposal, it's very kind. We don't need the financial support though. Cheers.

Everything is already working actually since quite a while actually. Closing this.

Closing this issue then ! Thanks for sharing @zTaoplus

If that works, it's likely to kill throughput... Batching is how we get throughput.

PR https://github.com/huggingface/text-generation-inference/pull/514 should help run MPT models on TGI. It doesn't use flash (yet) becauses that requires forking and extending flash attention kernel.

> I kind of new to this, but what are TGI images? We mean the docker images recommended to run this project (it makes everything smoother to use) https://github.com/huggingface/text-generation-inference#get-started

IT's not a target a seems unlikely given the amount of cuda specific kernels, but I know barely anything about those right now.

AFAIK, embeddings usually use very different models, and have very different properties. Including something here therefore doesn't make a whole lot of sense. `sentence-transformers` https://www.sbert.net/ is the basic way to...

Code complexity for something relatex to embeddings should be.. .MUCH smaller (there's no decode, no past key values, no paged attention). I think flash attention would be the main asset...