ArnaudHureaux

Results 10 comments of ArnaudHureaux

@dcbark01 is there an API/repo for the embedding similar to "huggingface/text-generation-inference" ? i didn't found it

No explanations ? Can it be about a lack of GPU memory ? Or is it because i increased the --max-model-len to 32768 ?

Ah ok, thanks @Narsil, and the authentification is required only for the download right ? + it is for every model ? how can i know by advance if i...

Are the number and the quality of GPUs used will influence the number of `--max-concurrent-requests` ? If i'am using 8 A100 GPUs, can i have a bigger `--max-concurrent-requests` than if...

Thanks @Narsil for you answer, but i have 3 questions : 1. how can i downgrade protobuf ? where have i to put those lines of codes in the repo...

It is true that it's 2 time faster than TGI ?

> Not in latency (Depends on the benchmark/hardware, but it is basically on par). > > PagedAttention seems to be nicer with respect to VRAM usage meaning it's better when...

> This plus QLoRA (esp if it can be combined with higher-context attention fixes e.g. Landmark, FlashAttention, ALiBi) would be huge for all sorts of things, but esp. CoT/ToT reasoning,...

On my case, the answer was totally random with message like "był AbramsPlayEvent磨}$,ocempreferred LaceKUZOOOoodlesWCHawaiiVEsecured cardvue ..." ?? I didn't have this comportment on other implementation, so i think that the...