OlivierDehaene comments

Results 119 comments of


OlivierDehaene

TraceID listed in server logs cannot be found in Grafana Tempo

That's very interesting thanks for sharing. It seems to be a bug in the code that generates the traceID. We never run into this issue on our side (since we...

[Meta] Road to 1.0 checklist

Can you ellaborate on the issues you faced while running text-generation-inference? We have been running it since october in EKS, AzureML (which uses k8s as a backend) and K3S and...

@sam-h-bean, we can put you in contact with our experts in our [Expert Acceleration Program](https://huggingface.co/support) if you need any help setting up text-generation-inference in your production environment.

Question: How to estimate memory requirements for a certain batch size/

You can use the benchmarking tool to make sure that you don't OOM at a given setting and then use these settings at maximum values in the launcher.

Context on slow Tensor to Vec conversion

All GPUs operations are asynchronous so this difference may be an artifact of this. See https://pytorch.org/tutorials/recipes/recipes/benchmark.html#pytorch-benchmark for example. It's expensive because in this case the tensor is of size `[N_TOKENS,...

break change for v1.2.1

This should not have hapened and is fixed on v1.2.2. The dockerfile was wrongly updated.

Launcher is not able to run the model even after model is completely downloaded when not connected to internet

@jtourille, could you share your solution with the community here? Cheers!

I cant use openai api

> Sometimes it happens, sometimes it doesn't happen. It may be a problem with my operating environment. On the same server?

Serve LARGE embedding models like E5-mistral-7b-instruct

What's your usecase for these models? Their throughput is so low and the costs so prohibitive that I don't see any.

Support of Docker/Kubernetes CPU limit/reservation

Super interesting! Can you try: ```yaml version: '3.4' services: multiminilml12v2: image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0 environment: - MODEL_ID=sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 - NVIDIA_DISABLE_REQUIRE=1 - RUST_BACKTRACE=full - JSON_OUTPUT=true - PORT=18083 - MAX_BATCH_TOKENS=65536 - MAX_CLIENT_BATCH_SIZE=1024 # interesting variables...