Nicolas Patry comments

Results 978 comments of


                                            Nicolas Patry

TransformerEngine FP8 speedup

Hi @SinanAkkoyun , Whenever we get our hands on some H100 and can debug code for it, we will likely implement.

TransformerEngine FP8 speedup

Thanks for the proposal, it's very kind. We don't need the financial support though. Cheers.

Adapter support

Everything is already working actually since quite a while actually. Closing this.

How to import chatglm model

Closing this issue then ! Thanks for sharing @zTaoplus

How to import chatglm model

If that works, it's likely to kill throughput... Batching is how we get throughput.

How do i use `trust_remove_code=True` for mosaic models?

PR https://github.com/huggingface/text-generation-inference/pull/514 should help run MPT models on TGI. It doesn't use flash (yet) becauses that requires forking and extending flash attention kernel.

Modifications run with GPU on AMD w/ ROCm

> I kind of new to this, but what are TGI images? We mean the docker images recommended to run this project (it makes everything smoother to use) https://github.com/huggingface/text-generation-inference#get-started

Modifications run with GPU on AMD w/ ROCm

IT's not a target a seems unlikely given the amount of cuda specific kernels, but I know barely anything about those right now.

[Feature] Return embeddings

AFAIK, embeddings usually use very different models, and have very different properties. Including something here therefore doesn't make a whole lot of sense. `sentence-transformers` https://www.sbert.net/ is the basic way to...

[Feature] Return embeddings

Code complexity for something relatex to embeddings should be.. .MUCH smaller (there's no decode, no past key values, no paged attention). I think flash attention would be the main asset...