unsloth icon indicating copy to clipboard operation
unsloth copied to clipboard

AWQ support

Open anslin-raj opened this issue 9 months ago • 16 comments

I have faced an error with the VLLM framework when I tried to inferencing an Unsloth fine-tuned LLAMA3-8b model...

Error:

(venv) ubuntu@ip-192-168-68-10:~/ans/vllm-server$ python -O -u -m vllm.entrypoints.openai.api_server --host=127.0.0.1 --port=8000 --model=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit --tokenizer=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit --dtype=half INFO 05-14 09:46:09 api_server.py:151] vLLM API server version 0.4.1 INFO 05-14 09:46:09 api_server.py:152] args: Namespace(host='127.0.0.1', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit', tokenizer='/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit', skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, model_loader_extra_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None) Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 159, in engine = AsyncLLMEngine.from_engine_args( File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 341, in from_engine_args engine_config = engine_args.create_engine_config() File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 464, in create_engine_config model_config = ModelConfig( File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/config.py", line 115, in init self._verify_quantization() File "/home/ubuntu/ans/vllm-server/venv/lib/python3.10/site-packages/vllm/config.py", line 160, in _verify_quantization raise ValueError( ValueError: Unknown quantization method: bitsandbytes. Must be one of ['aqlm', 'awq', 'fp8', 'gptq', 'squeezellm', 'marlin'].

Code:

model, tokenizer = FastLanguageModel.from_pretrained( model_name = "meta-llama/Meta-Llama-3-8B", max_seq_length = max_seq_length, dtype = dtype, load_in_4bit = load_in_4bit, )

model = FastLanguageModel.get_peft_model( model, r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128 target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",], lora_alpha = 16, lora_dropout = 0, # Supports any, but = 0 is optimized bias = "none", # Supports any, but = "none" is optimized # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes! use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context random_state = 3407, use_rslora = False, # We support rank stabilized LoRA loftq_config = None, # And LoftQ )

trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset, dataset_text_field = "text", max_seq_length = max_seq_length, dataset_num_proc = 2, packing = False, # Can make training 5x faster for short sequences. callbacks=[RichProgressCallback], args = TrainingArguments( # num_train_epochs=1, per_device_train_batch_size = 2, gradient_accumulation_steps = 4, warmup_steps = 5, # max_steps = 2048, max_steps = 5, learning_rate = 2e-4, fp16 = not torch.cuda.is_bf16_supported(), bf16 = torch.cuda.is_bf16_supported(), logging_steps = 1, optim = "adamw_8bit", weight_decay = 0.01, lr_scheduler_type = "linear", seed = 3407, output_dir = "outputs", # logging_dir=f"/home/ubuntu/ans/llama3_pipeline/fine_tuning/logs", ), )

trainer_stats = trainer.train() if True: model.save_pretrained_merged("/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit", tokenizer, save_method="merged_4bit_forced",)

VLLM cli:

python -O -u -m vllm.entrypoints.openai.api_server --host=127.0.0.1 --port=8000 --model=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit --tokenizer=/home/ubuntu/ans/llama3_pipeline/fine_tuning/llama3_8b_13_05_2024/vllm_merged_4bit

Package Versions:

unsloth 2024.4 vllm 0.4.1 NVIDIA-SMI 550.67 Driver Version 550.67 CUDA Version 12.4 Python 3.10.12 torch 2.2.1

Hardware used:

Tesla T4 GPU Memory 32 GB 8 core CPU

anslin-raj avatar May 14 '24 19:05 anslin-raj

https://github.com/unslothai/unsloth/issues/253 ,I think you can refer to this answer; it seems that vLLM currently only supports AWQ-4b or 8b

Karry11 avatar May 15 '24 14:05 Karry11

You need to change merged_4bit_forced to merged_16bit

danielhanchen avatar May 15 '24 19:05 danielhanchen

Thanks for the response @Karry11 @danielhanchen,

I tried merged_16bit, and it required more VRAM, but I only have 16 GB VRAM, is there any other way to run the model in VLLM with 4-bit quantization method?

anslin-raj avatar May 18 '24 16:05 anslin-raj

Convert it to AWQ if want to use VLLM , other wise Unsloth inference for 4bit models

sparsh35 avatar May 24 '24 03:05 sparsh35

Ye AWQ is nice :) We might be adding a AWQ option for exporting!

danielhanchen avatar May 24 '24 10:05 danielhanchen

What's the current best option if I have to use this 4bit finetuned model using vLLM inference- Is it to convert it to 16bit and then perform the inference?

subhamiitk avatar May 24 '24 18:05 subhamiitk

@subhamiitk Use model.save_pretrained_merged("location", tokenizer, save_method = "merged_16bit",) then use vLLM

danielhanchen avatar May 25 '24 09:05 danielhanchen

Thanks for the consideration @danielhanchen

anslin-raj avatar May 30 '24 06:05 anslin-raj

vLLM's MultiLoRA deployment option + PEFT's recent feature release - training adapters on top of already AWQ quantized models opens up some really useful possibilities for inference. Mainly, budget GPU's could easily serve multiple adapters under one awq model - aka minimizing memory footprint thus pushing faster throughput.

Exporting an AWQ model is great, but I also see value in training adapters on already AWQ quantized models. Is there any desire to support this? Would be killer to have unsloth's performance boosts for this type of fine tuning.

wrisigo avatar Jun 25 '24 15:06 wrisigo

So sorry on the delay - just relocated to SF - exporting to AWQ is for now on the roadmap - directly finetuning AWQ could work as well, but will require changing fast_dequantize

danielhanchen avatar Jul 01 '24 00:07 danielhanchen

@danielhanchen no issues, thanks for the update... ✨

anslin-raj avatar Jul 02 '24 05:07 anslin-raj

Finetuning a AWQ image would be amazing. I see it has support for PEFT in transformers https://github.com/huggingface/transformers/pull/28987 . this would be amazing to have, it would mean everyone can just work with awq models. @danielhanchen

vladrad avatar Jul 03 '24 21:07 vladrad

I'll see what I can do!

danielhanchen avatar Jul 04 '24 05:07 danielhanchen

Thank you! Let me know if there is anything I can do to help test. I can write code as well though this stuff is not my specialty but id love to learn! Feel free to point me somewhere. Being able to fine a AWQ model on low end hardware and then not having to wait an hour to convert it is going to be huge!

vladrad avatar Jul 05 '24 18:07 vladrad

Oh ye converting it to AWQ takes a lot of time!!

danielhanchen avatar Jul 06 '24 03:07 danielhanchen

Waiting for automagic support of awq models as well. Anything I can do to help/speed things along?

StrangeTcy avatar Aug 05 '24 02:08 StrangeTcy