hank
hank
您好,我在使用MP分词的时候,有英文单词不在词典中,通过分词之后英文单词被分成了单个字母,请问需要怎么修改才能保证不在词典中的英文单词不被切分成字母? 例如输入的句子: 你好,are you ok? MP模式分词结果为: 你好 , a r e y o u o k ? 先谢谢您的帮助:)
@mrwyattii Use latest main branch and test model is llamav2-7b. When I use tp=4 to test a single sentence inference, it costs 267.98s, but when tp=1, it costs 7s to...
Great work thanks for sharing!!! I used the fastchat code combined with the apibench/huggingface_train.json data and the llamav2-7b model to retrain to get a new model, but the inference result...
As you know, flashatten3 promises 1.5x~ improvements Is there any plan for support? Thanks! https://github.com/Dao-AILab/flash-attention/commit/7ef24848cf2f855077cef88fe122775b727dcd74
### System Info ### System Info GPU: NVIDIA A100 Driver Version: 545.23.08 CUDA: 12.3 versions: https://github.com/NVIDIA/TensorRT-LLM.git (5fa9436) (latest version) https://github.com/triton-inference-server/tensorrtllm_backend ( [a6aa8eb](https://github.com/triton-inference-server/tensorrtllm_backend/commit/a6aa8eb6ce9371521df166c480e10262cd9c0cf4)) ### Who can help? _No response_ ### Information...
### Your current environment ```text nvidia A100 GPU vllm 0.6.0 ``` ### How would you like to use vllm I want to run inference of a AutoModelForSequenceClassification. I don't know...
### Your current environment The output of `python collect_env.py` ```text PyTorch version: 2.5.1+cu124 Is debug build: False CUDA used to build PyTorch: 12.4 ROCM used to build PyTorch: N/A OS:...
## 🐛 Bug The service started based on Meta-Llama-3.1-70B-Instruct fp8 will crash when running a large concurrency. ## To Reproduce ### convert model refer this issue: #2982 ### start service...
when I use the **ultra_chat 200k** data (without regenerating the assistant data from the target model) to train the llama3.1-8b-instruct model, the **training acc is only around 35%** and **loss...
### Checklist - [x] 1. I have searched related issues but cannot get the expected help. - [x] 2. The bug has not been fixed in the latest version. -...