proxy-tuning seems ineffective in some settings

Open NuoJohnChen opened this issue 1 year ago • 0 comments

I use Qwen-2-0.5b as anti_expert_model, Qwen-2-0.5b tuned on codex_humaneval as expert_model, and Qwen2-7B as base_model, the EM score of proxy-tuned Qwen2-7B is: 0.4167682926829268.

# Evaluating DExperts with codex_humaneval expert
size=13
echo "Results dir: results/codex_humaneval/dexperts-7B"
python -m eval.codex_humaneval.run_eval_new \
    --data_file data/eval/codex_humaneval/HumanEval.jsonl \
    --save_dir results/codex_humaneval/dexperts-7B \
    --base_model_name_or_path Qwen2-7B \
    --expert_model_name_or_path qwen-2-codealpaca-0.5b \
    --eval_batch_size 20

But the EM score of Qwen2-7B as base_model result is 0.463109756097561, which is even higher than proxy-tuned Qwen2-7B. (runned in https://github.com/allenai/open-instruct/)

size=7
echo "Results dir: results/codex_humaneval/Qwen2-${size}B"
python -m eval.codex_humaneval.run_eval \
   --data_file data/eval/codex_humaneval/HumanEval.jsonl \
   --save_dir results/codex_humaneval/Qwen2-${size}B \
   --model_name_or_path Qwen2-${size}B \
   --eval_batch_size 20

I doubt what's wrong with it.

Aug 08 '24 14:08 NuoJohnChen