LMOps
LMOps copied to clipboard
[MiniLLM] teacher generated responses `gen_answer` not used in seqKD
I'm running sequence level KD of llama. And in the first step of generating responses with teacher:
- Generate responses with the teacher:
bash scripts/llama/tools/generate_data_seqkd.sh /PATH/TO/MiniLLM bash scripts/llama/tools/process_pseudo_data_seqkd.sh /PATH/TO/MiniLLM
I observed a problem. Here scripts/llama/tools/generate_data_seqkd.sh
will create an augmented dataset in a jsonl file, where all json objects are of this format:
{
"instruction": "...", # the instruction
"prompt": "...", # the instruction prompt including the input data
"input": "...", # the input data
"output": "...", # the ground truth
"gen_answer": "...", # the teacher generated response
}
Later on when creating binary files to store tokenized version of this new dataset, scripts/llama/tools/process_pseudo_data_seqkd.sh
only uses instruction
, input
, and output
for tokenization, and gen_answer
is not used at all, while I believe gen_answer
should be used instead of output
Is this a bug?