what should I do if I want to improve the performance of hellaswag?
I want to find some dataset , for example OpenO1, KD 14B to 3B, or use lora, but I have a bad result:
the result of KD only reach 96.8% of the ori 3B Qwen2.5 model
what should I do? Thanks.
Did you fine-tune the 14B model on your desired dataset first? That's an important pre-step to knowledge distillation.
Sorry I didn't, I mistakenly thought it was not important.
Sorry I didn't, I mistakenly thought it was not important.
All good - give that a go and LMK how it works after re-evaluating
my lora_14B.yaml is
model:
_component_: torchtune.models.qwen2_5.lora_qwen2_5_14b_instruct
lora_attn_modules: ['q_proj', 'k_proj', 'v_proj','output_proj']
apply_lora_to_mlp: True
apply_lora_to_output: False
lora_rank: 64 # higher increases accuracy and memory
lora_alpha: 128 # usually alpha=2*rank
lora_dropout: 0.0
Dataset and Sampler
dataset:
_component_: torchtune.datasets.instruct_dataset
source: json
data_files: /generate_data/sft_data/sft-hellaswag_data.jsonl
split: train
column_map:
input: prompt
output: response
seed: null
shuffle: True
batch_size: 8
epochs: 2
max_steps_per_epoch: null
gradient_accumulation_steps: 8
compile: False
and the sft-hellaswag_data.jsonl format is below display. {"prompt": "With the text and choices provided, determine the sentence that best serves as the subsequent one. Return the answer as a number.\nText: Getting a haircut: He uses an electric clipper to groom the sideburns and the temples. He also trims the back and sides of his head with the clippers. He\nChoices: ['1. then picks up some lipstick on the table to apply it to his face.', '2. uses scissors to trim the hair and give it a finished look.', '3. decorates with a very light style lip liner to complete the look.', '4. then polishes his front teeth with a razor.']", "response": "1"}
i evalute the dataset with lm-eval. and the performance of hellaswag is lower than Qwen2.5 14B-instruct
Other words, i didn't get a better result from lora with 14B-Instruct, then i have no reason to do the next step, distil the 3B model with 14B_lora model, Can I understand that