ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

issue with qlora fine-tuning on Flex GPU

Open tsantra opened this issue 1 year ago • 10 comments

Hi,

I am trying to use the Qlora code as provided in the repo on a Sapphire Rapids, Flex GPU machine.

I was able to run the qlora_finetuning.py without any error.

But the export_merged_model.py is giving me this error:

image

The command I used to merge the model: python ./export_merged_model.py --repo-id-or-model-path < path to llama-2-7b-chat-hf> --adapter_path ./outputs/checkpoint-200 --output_path ./outputs/checkpoint-200-merged

OS : Ubuntu 22 This is my training info: image

tsantra avatar Oct 30 '23 23:10 tsantra

Hi, @tsantra Would you mind trying it again after pip install accelerate==0.23.0 ?

rnwang04 avatar Oct 31 '23 02:10 rnwang04

@rnwang04 Thank you. It worked after installing accelerate=0.23.0

I have two questions:

  1. Is QLora fine-tuning supported on CPU?
  2. The code here https://github.com/intel-analytics/BigDL/blob/main/python/llm/example/GPU/QLoRA-FineTuning/export_merged_model.py , shows device_map={"": cpu}, so which part of the code is running on the Flex GPU?

tsantra avatar Oct 31 '23 22:10 tsantra

Hi @tsantra ,

  1. Yes, it's supported on CPU, we will provide an official CPU example later.
  2. After you got the merged model (for example checkpoint-200-merged), you can use it as a normal huggingface transformer model to do inference on Flex GPU, like https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/llama2

rnwang04 avatar Nov 01 '23 01:11 rnwang04

Hi @tsantra , QLoRA CPU example is updated here(https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/CPU/QLoRA-FineTuning)

rnwang04 avatar Nov 02 '23 01:11 rnwang04

Hi @rnwang04 , thank you for your reply!

Are you using any metric to check for model accuracy after QLora finetuning. I had used my custom dataset for finetuning and my inference results are not good. Model is hallucinating a lot. Do you have any BKM for fine-tuning?

Are you also using any profiler to check for GPU memory usage? Do you have any suggestion?

tsantra avatar Nov 03 '23 22:11 tsantra

Had closed by mistake.

tsantra avatar Nov 03 '23 22:11 tsantra

@rnwang04 GPU finetuning suddenly stopped working and gave Seg Fault.

image

tsantra avatar Nov 06 '23 00:11 tsantra

@rnwang04 GPU finetuning suddenly stopped working and gave Seg Fault.

Hi @tsantra , have you ever run GPU finetuning successfully ? or you always meet this error? If you ever run GPU finetuning successfully before, do you make any changes to your script or env settings?

rnwang04 avatar Nov 06 '23 01:11 rnwang04

Are you using any metric to check for model accuracy after QLora finetuning. I had used my custom dataset for finetuning and my inference results are not good. Model is hallucinating a lot. Do you have any BKM for fine-tuning?

Have you checked your loss curve of finetuning? Is the loss decreasing normally during the finetune process and ultimately stabilizing at a fixed value? What are the approximate train loss and eval loss in the end?

Are you also using any profiler to check for GPU memory usage? Do you have any suggestion?

I just use "GPU Memory Used" column in "sudo xpu-smi stats -d 0" to check GPU memory usage.

rnwang04 avatar Nov 06 '23 06:11 rnwang04

@rnwang04 GPU finetuning suddenly stopped working and gave Seg Fault.

image

are you running it inside vscode?

shane-huang avatar Jan 22 '24 12:01 shane-huang