Starrick Liu
Starrick Liu
**What is your question?** I am a newcomer just starting to learn CuTe and am hoping to implement a gemv in kernel functions in a style similar to CuTe, akin...
Issue 1: ``` python3 uniform_finetune.py --model_type chatglm --model_name_or_path THUDM/chatglm-6b \ --data alpaca-belle-cot --lora_target_modules query_key_value \ --lora_r 32 --lora_alpha 32 --lora_dropout 0.1 --per_gpu_train_batch_size 2 \ --learning_rate 2e-5 --epochs 1 ``` 运行上述命令后会在训练阶段OOM:...
As indicated by the title, on the main branch, I used 40 threads to simultaneously send inference requests to the in-flight Triton Server, resulting in the Triton Server getting stuck....
## KV Cache Reuse and Int8 KV Cache Compatibility with Paged Context FMHA In TensorRT-LLM v0.11, it appears that KV cache reuse and Int8 KV cache cannot be used together....