OlivierDehaene

Results 62 comments of OlivierDehaene

It would be pretty easy to support arrays like we do in TEI. Just push all requests in the internal queue and wait. But I feel that the client would...

I think the problem is your cuda driver version: ``` | | 2024-03-04 10:54:22 | /opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version...

The model loaded on cpu for some reason. "model_device_type": "cpu", in the info. Can you run `docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id $model --num-shard...

I agree that this would be nice and a tokenization refactor is long overdue. We will think about it in q4.

Increasing the timeout will only make you crash later, it will not fix the issue. When do you have this issue?

Usually NCCL timeouts because one of the shard OOMs and the other shard end up waiting indefinetly for the OOMed shard. Can you check if this is what is happening...

What type of server is it? How are the H100s inter-connected?

Sorry I thought this was fixed on the runpod side. Re-opening.

Yes that's something that we want to explore.