Hugo Larcher comments

Results 6 comments of


                                            Hugo Larcher

wrong publicUrl endpoint

Hello There is a mix-up or URLs here. If you want to access GRA object storage via S3 protocol, you should follow the doc here https://docs.ovh.com/gb/en/storage/getting_started_with_the_swift_S3_API/ If you are using...

Queue size increases indefinitely

Hey @QLutz , I suspect it may be related to #2099. Can you try to run TGI with `--cuda-graphs 0` see if you still see the hang?

feat: Add load tests

> One small comment is that this is still quite gnarly for others to use and run on their own machines. And tbh that's also because some things (like k6-sse)...

CPU and Memory Utilization for TGI

Hi @snps-ravinu , thanks for your feedback! For CPU and memory utilization, it is probably better to use those from the container runtime (if using k,8s, that would be `metrics-server`...

can't start server with small --max-total-tokens. But works fine with big stting

Hello @rooooc! Your issue probably relates to not setting `max-batch-total-tokens` (https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher#maxbatchtotaltokens). By setting different values for `max-total-tokens` and `max-batch-prefill-tokens` you are not controlling the max tokens that can be batched...

can't start server with small --max-total-tokens. But works fine with big stting

@rooooc , you should be able to reduce the `max-batch-total-tokens` until you have an acceptable value for your GPU memory. As stated in doc: > Overall this number should be...