unsloth
unsloth copied to clipboard
Support Qwen2
We add support of Qwen2 which is important for open-source community. Our repo Firefly has already supported training Qwen2 with Unsloth, more experiment details can be seen in our model card.
We have evaluated the training gain of Qwen1.5-7B, we use QLoRA and Unsloth to train Qwen1.5-7B for 20 steps on a single V100. The result can be listed as follows. Unsloth can reduce GPU memory by 39.13% and training time by 32.12%, and the training speed can increase by 47.32%.
| max_seq_length | per_device_train_batch_size | gradient_accumulation_steps | use_unsloth | rank | GPU | Time |
|---|---|---|---|---|---|---|
| 1024 | 1 | 16 | false | 8 | 13.72GB | 448s |
| 1024 | 1 | 16 | true | 8 | 8.43GB(-38.56%) | 308s(-31.25%) |
| 1024 | 1 | 16 | false | 64 | 16.01GB | 452s |
| 1024 | 1 | 16 | true | 64 | 11.07GB(-30.86%) | 311s(-31.19%) |
| 2048 | 1 | 16 | false | 64 | 18.55GB | 840s |
| 2048 | 1 | 16 | true | 64 | 12.99GB(-29.97%) | 596s(-29.05%) |
| 1024 | 4 | 4 | false | 64 | 24.70GB | 357s |
| 1024 | 4 | 4 | true | 64 | 14.36GB(-41.86%) | 253s(-29.13%) |
| 2048 | 4 | 4 | false | 64 | 32.51GB | 741s |
| 2048 | 4 | 4 | true | 64 | 19.79GB(-39.13%) | 503s(-32.12%) |
We also evaluate our sft and dpo models with Unsloth on Open LLM Leaderboard, they achieve good performance and outperform the official Qwen1.5-7B-Chat.
| Model | Average | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K |
|---|---|---|---|---|---|---|---|
| firefly-gemma-7b | 62.93 | 62.12 | 79.77 | 61.57 | 49.41 | 75.45 | 49.28 |
| firefly-qwen1.5-en-7b-dpo-v0.1-unsloth | 62.65 | 56.14 | 75.5 | 60.87 | 58.09 | 70.72 | 54.59 |
| zephyr-7b-beta | 61.95 | 62.03 | 84.36 | 61.07 | 57.45 | 77.74 | 29.04 |
| firefly-qwen1.5-en-7b-unsloth | 61.81 | 54.27 | 76.22 | 61.55 | 50.62 | 70.48 | 57.7 |
| vicuna-13b-v1.5 | 55.41 | 57.08 | 81.24 | 56.67 | 51.51 | 74.66 | 11.3 |
| Xwin-LM-13B-V0.1 | 55.29 | 62.54 | 82.8 | 56.53 | 45.96 | 74.27 | 9.63 |
| Qwen1.5-7B-Chat | 55.15 | 55.89 | 78.56 | 61.65 | 53.54 | 67.72 | 13.57 |
| gemma-7b-it | 53.56 | 51.45 | 71.96 | 53.52 | 47.29 | 67.96 | 29.19 |
@yangjianxin1 Oh wait does Qwen2 not have that weird alternating sliding window & normal attention thingo?
Yes, there is not weird alternating sliding window & normal attention in Qwen2, and its use_sliding_window is false in the config.json.
And I have compared the code between Llama and Qwen2 almost line by line, they are very similar.
Thanks for the PR again! I streamlined Qwen2 to call FastMistralModel (since I think it's an exact replica right?)
Could you please provide a detailed explanation of the specific process of fine-tuning Qwen-1.5B-Chat using Unsloth?I want to fine-tune Qwen1.5-7B myself.