LMOps icon indicating copy to clipboard operation
LMOps copied to clipboard

[MiniLLM] teacher generated responses `gen_answer` not used in seqKD

Open hieuchi911 opened this issue 6 months ago • 2 comments

I'm running sequence level KD of llama. And in the first step of generating responses with teacher:

  • Generate responses with the teacher:
    bash scripts/llama/tools/generate_data_seqkd.sh /PATH/TO/MiniLLM
    bash scripts/llama/tools/process_pseudo_data_seqkd.sh /PATH/TO/MiniLLM
    

I observed a problem. Here scripts/llama/tools/generate_data_seqkd.sh will create an augmented dataset in a jsonl file, where all json objects are of this format:

{
  "instruction": "...",    # the instruction
  "prompt": "...",    # the instruction prompt including the input data
  "input": "...",    # the input data
  "output": "...",    # the ground truth
  "gen_answer": "...",    # the teacher generated response
}

Later on when creating binary files to store tokenized version of this new dataset, scripts/llama/tools/process_pseudo_data_seqkd.sh only uses instruction, input, and output for tokenization, and gen_answer is not used at all, while I believe gen_answer should be used instead of output

Is this a bug?

hieuchi911 avatar Aug 01 '24 22:08 hieuchi911