torchtune what should I do if I want to improve the performance of hellaswag?

I want to find some dataset , for example OpenO1, KD 14B to 3B， or use lora, but I have a bad result: the result of KD only reach 96.8% of the ori 3B Qwen2.5 model what should I do? Thanks.

Dec 12 '24 07:12 mathCrazyy

Did you fine-tune the 14B model on your desired dataset first? That's an important pre-step to knowledge distillation.

Dec 12 '24 11:12 joecummings

Sorry I didn't, I mistakenly thought it was not important.

Dec 13 '24 02:12 mathCrazyy

Sorry I didn't, I mistakenly thought it was not important.

All good - give that a go and LMK how it works after re-evaluating

Dec 13 '24 12:12 joecummings

my lora_14B.yaml is

model:
  _component_: torchtune.models.qwen2_5.lora_qwen2_5_14b_instruct
  lora_attn_modules: ['q_proj', 'k_proj', 'v_proj','output_proj']
  apply_lora_to_mlp: True
  apply_lora_to_output: False
  lora_rank: 64  # higher increases accuracy and memory
  lora_alpha: 128  # usually alpha=2*rank
  lora_dropout: 0.0
Dataset and Sampler
dataset:
  _component_: torchtune.datasets.instruct_dataset
  source: json
  data_files: /generate_data/sft_data/sft-hellaswag_data.jsonl
  split: train
  column_map:
    input: prompt
    output: response
seed: null
shuffle: True
batch_size: 8
epochs: 2
max_steps_per_epoch: null
gradient_accumulation_steps: 8 
compile: False

and the sft-hellaswag_data.jsonl format is below display. {"prompt": "With the text and choices provided, determine the sentence that best serves as the subsequent one. Return the answer as a number.\nText: Getting a haircut: He uses an electric clipper to groom the sideburns and the temples. He also trims the back and sides of his head with the clippers. He\nChoices: ['1. then picks up some lipstick on the table to apply it to his face.', '2. uses scissors to trim the hair and give it a finished look.', '3. decorates with a very light style lip liner to complete the look.', '4. then polishes his front teeth with a razor.']", "response": "1"}

i evalute the dataset with lm-eval. and the performance of hellaswag is lower than Qwen2.5 14B-instruct

Other words, i didn't get a better result from lora with 14B-Instruct, then i have no reason to do the next step, distil the 3B model with 14B_lora model, Can I understand that

Dec 19 '24 11:12 mathCrazyy