ArnaudHureaux
ArnaudHureaux
@dcbark01 is there an API/repo for the embedding similar to "huggingface/text-generation-inference" ? i didn't found it
No explanations ? Can it be about a lack of GPU memory ? Or is it because i increased the --max-model-len to 32768 ?
Ah ok, thanks @Narsil, and the authentification is required only for the download right ? + it is for every model ? how can i know by advance if i...
Thanks a lot @Narsil
Are the number and the quality of GPUs used will influence the number of `--max-concurrent-requests` ? If i'am using 8 A100 GPUs, can i have a bigger `--max-concurrent-requests` than if...
Thanks @Narsil for you answer, but i have 3 questions : 1. how can i downgrade protobuf ? where have i to put those lines of codes in the repo...
It is true that it's 2 time faster than TGI ?
> Not in latency (Depends on the benchmark/hardware, but it is basically on par). > > PagedAttention seems to be nicer with respect to VRAM usage meaning it's better when...
> This plus QLoRA (esp if it can be combined with higher-context attention fixes e.g. Landmark, FlashAttention, ALiBi) would be huge for all sorts of things, but esp. CoT/ToT reasoning,...
On my case, the answer was totally random with message like "był AbramsPlayEvent磨}$,ocempreferred LaceKUZOOOoodlesWCHawaiiVEsecured cardvue ..." ?? I didn't have this comportment on other implementation, so i think that the...