rtx-8000 comments

Results 6 comments of


                                            rtx-8000

[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none.

Model: llama-3.3-70b-instruct-awq LoRA config: ```json { "alpha_pattern": {}, "auto_mapping": null, "base_model_name_or_path": "/u01/app/mlo/models/Llama-3.3-70B-Instruct", "bias": "none", "fan_in_fan_out": false, "inference_mode": true, "init_lora_weights": true, "layer_replication": null, "layers_pattern": null, "layers_to_transform": null, "loftq_config": {}, "lora_alpha": 512,...

[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none.

Hello, I ran more tests setting max_num_seqs to 1. I am now getting worse result for adapter of rank 16 than for rank 256. They both have the same configuration,...

[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none.

Thank you for your comment. When running the following command: ``` VLLM_USE_V1="1" python scripts/vllm_infer.py --model_name_or_path /u01/data/analytics/models/llama-3.3-70b-instruct-awq/ --adapter_name_or_path saves/llama3.3-70b/fsdp_qlora_aug_tag_r256/sft/ --dataset sft_dataset_aug_tag --vllm_config "{gpu_memory_utilization: 0.6, max_model_len: 700, max_seq_len_to_capture: 700, max_lora_rank:256, max_num_seqs: 1}"...

[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none.

Thank you @jeejeelee . I also noticed something weird, I am testing on 2 datasets with one with longer sequences. When I run the model with Lora on shorter sequence...

[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none.

So what could be wrong for me to have slower performance for lower rank, especially for max_num_seqs (request rate) = 1 or no such big difference for higher rate?

[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none.

``` python scripts/vllm_infer.py --model_name_or_path /u01/data/analytics/models/llama-3.3-70b-instruct-awq/ --adapter_name_or_path saves/llama3.3-70b/fsdp_qlora_aug_tag_r256/sft/ --dataset sft_dataset_aug_tag --vllm_config "{gpu_memory_utilization: 0.6, max_model_len: 1024, max_lora_rank: 256, max_num_seqs: 1"} ``` and this is the [vllm_infer.py](https://github.com/hiyouga/LLaMA-Factory/blob/main/scripts/vllm_infer.py) and this the GPUs I have...