Nicolas Patry
Nicolas Patry
Hi @amtam0 it runs on the API, explaining the first load time. The code to actually run the inference is available and you can try doing: ``` cd huggingface_hub/api-inference-community ./manage.py...
sorry I don't know, I can't really help.
Yes, llama3 has 2 eos tokens. eot_id for turn token, and. "real" eos_token (not sure when used). Currently the config defines `` as the eos token, which if what you're...
Yes it is. And hf-chat sends that stop token currently.
Slow ? What do you mean ? What hardware, TP ? What is slow in this case?
The frequency penalty is being solved soon : https://github.com/huggingface/text-generation-inference/pull/1765. For the stop token, yes it's unfortunate setup, we're solving changing the default in many places (basically there are 2 stop...
> We are expecting to get the updated docker image by Monday next week. Do you think a TGI release on next week Tuesday/Wednesday with this PR in is feasible?...
Hi @anttttti It's really hard to create a `handler.py` within TGI itself. The reason is that the current code is a tight loop highly catered for performance. And as a...
Can you show the command you're using ? Also show all the logs here please, we can't help without actual information and reproducibility.
Hi @LeoDog896 , Sorry for the long wait on this. I do not understand the need for it. Why do you need to JSON representation of `TensorView` ? I'm hesitant...