ArnaudHureaux comments

Results 10 comments of


                                            ArnaudHureaux

[Feature] Langchain compatability

@dcbark01 is there an API/repo for the embedding similar to "huggingface/text-generation-inference" ? i didn't found it

Aborted request without reason

No explanations ? Can it be about a lack of GPU memory ? Or is it because i increased the --max-model-len to 32768 ?

"Unauthorized for url: https://huggingface.co/api/models/bigcode/starcoder"

Ah ok, thanks @Narsil, and the authentification is required only for the download right ? + it is for every model ? how can i know by advance if i...

"Unauthorized for url: https://huggingface.co/api/models/bigcode/starcoder"

Thanks a lot @Narsil

How to solve "Model is overloaded" when sending 500 requests?

Are the number and the quality of GPUs used will influence the number of `--max-concurrent-requests` ? If i'am using 8 A100 GPUs, can i have a bigger `--max-concurrent-requests` than if...

"TypeError: Descriptors cannot not be created directly"

Thanks @Narsil for you answer, but i have 3 questions : 1. how can i downgrade protobuf ? where have i to put those lines of codes in the repo...

Support PagedAttention

It is true that it's 2 time faster than TGI ?

Support PagedAttention

> Not in latency (Depends on the benchmark/hardware, but it is basically on par). > > PagedAttention seems to be nicer with respect to VRAM usage meaning it's better when...

Support PagedAttention

> This plus QLoRA (esp if it can be combined with higher-context attention fixes e.g. Landmark, FlashAttention, ALiBi) would be huge for all sorts of things, but esp. CoT/ToT reasoning,...

Falcon 40B : too slow and random answers

On my case, the answer was totally random with message like "był AbramsPlayEvent磨}$,ocempreferred LaceKUZOOOoodlesWCHawaiiVEsecured cardvue ..." ?? I didn't have this comportment on other implementation, so i think that the...