OlivierDehaene comments

Results 119 comments of


OlivierDehaene

Support of Docker/Kubernetes CPU limit/reservation

Ok then I'm not sure there is a lot that can be done here besides adding some documentation to explain this issue in the README/docs.

Support of Docker/Kubernetes CPU limit/reservation

Setting these values correctly would be really hard since they are MKL/runtime specific. Plus these should be set before execution so this imply creating a launching script above the TEI...

Analyze options for batching inputs & streaming results for LM inference

Just to chime in the discussion: we developped at Hugging Face a specific backend to handle dynamic batching for the LLMs hosted on our platform: [text-generation-inference](https://github.com/huggingface/text-generation-inference). This backend works by...

Analyze options for batching inputs & streaming results for LM inference

Yes this is very close to what I had in mind! If you are ok with this I will continue your work in a fork of your fork tomorrow and...

Analyze options for batching inputs & streaming results for LM inference

[This is roughly what I had in mind](https://github.com/huggingface/text-generation-inference/pull/36). There are still some things to iron out. One question I have is regarding the API. I think the only events we...

Analyze options for batching inputs & streaming results for LM inference

@yk, I'm done with my implem [here](https://github.com/huggingface/text-generation-inference/pull/36). Does the following SSE event signature cover your usecase? ```rust struct Details { finish_reason: String, generated_tokens: u32, seed: Option, } struct StreamResponse {...

OlivierDehaene

Support of Docker/Kubernetes CPU limit/reservation

Support of Docker/Kubernetes CPU limit/reservation

Analyze options for batching inputs & streaming results for LM inference

Analyze options for batching inputs & streaming results for LM inference

Analyze options for batching inputs & streaming results for LM inference

Analyze options for batching inputs & streaming results for LM inference

Support for Contrastive Search?

feat(router): Dynamic batch sizing

QLora Support

Do not init process group if already initialized