OlivierDehaene comments

Results 149 comments of


                                            OlivierDehaene

feat: accept list as prompt and use first string

It would be pretty easy to support arrays like we do in TEI. Just push all requests in the internal queue and wait. But I feel that the client would...

ValueError: The checkpoint you are trying to load has model type `starcoder2` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

I think the problem is your cuda driver version: ``` | | 2024-03-04 10:54:22 | /opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version...

ValueError: The checkpoint you are trying to load has model type `starcoder2` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

The driver version is related to your host running the container, not the docker image.

ERROR shard-manager When run bigcode/starcoder

The model loaded on cpu for some reason. "model_device_type": "cpu", in the info. Can you run `docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id $model --num-shard...

Allow client to provide prompt token ids instead of a string

I agree that this would be nice and a tokenization refactor is long overdue. We will think about it in q4.

Configuable NCCL timeout

Increasing the timeout will only make you crash later, it will not fix the issue. When do you have this issue?

Configuable NCCL timeout

Usually NCCL timeouts because one of the shard OOMs and the other shard end up waiting indefinetly for the OOMed shard. Can you check if this is what is happening...

Configuable NCCL timeout

What type of server is it? How are the H100s inter-connected?

--quantize bitsandbytes or --quantize gptq does not work.

Sorry I thought this was fixed on the runpod side. Re-opening.

Add support for Speculative Decoding

Yes that's something that we want to explore.