Starrick Liu issues

Results 4 issues of


                                            Starrick Liu

[QST] How to Use Gemv in a Cuda Kernel

**What is your question?** I am a newcomer just starting to learn CuTe and am hoping to implement a gemv in kernel functions in a style similar to CuTe, akin...

question

inactive-30d

ChatGLM的Finetune推荐命令，使用3090 24G会OOM，代码默认使用8Bit量化同样会导致OOM

Issue 1： ``` python3 uniform_finetune.py --model_type chatglm --model_name_or_path THUDM/chatglm-6b \ --data alpaca-belle-cot --lora_target_modules query_key_value \ --lora_r 32 --lora_alpha 32 --lora_dropout 0.1 --per_gpu_train_batch_size 2 \ --learning_rate 2e-5 --epochs 1 ``` 运行上述命令后会在训练阶段OOM：...

Under the main branch, stress testing the in-flight Triton Server with multiple threads can result in the Triton Server getting stuck.

As indicated by the title, on the main branch, I used 40 threads to simultaneously send inference requests to the in-flight Triton Server, resulting in the Triton Server getting stuck....

[Feature Request] Support for kv_reuse with int8_kv_cache in FMHA

## KV Cache Reuse and Int8 KV Cache Compatibility with Paged Context FMHA In TensorRT-LLM v0.11, it appears that KV cache reuse and Int8 KV cache cannot be used together....

feature request

quantization

stale