Inquiry Regarding the Public Release of LLaDA-8B-Instruct Evaluation Code and Result Validation
I hope this message finds you well. I'm reaching out regarding the public release of the full evaluation code for LLaDA-8B-Instruct. Currently, when testing on datasets like HumanEval and MBPP, the results I obtained show some discrepancies compared to those reported in your paper. Additionally, when configuring lm-eval for testing on GPQA or MATH, parameters such as block_length have yielded unexpected outcomes, which makes me wonder if there might be specific configurations or settings that I'm missing. Thank you for your time and consideration.
For the GPQA task, the command used was :
accelerate launch eval_llada.py --tasks gpqa_main_generative_n_shot --model llada_dist --confirm_run_unsafe_code --num_fewshot 5 --model_args model_path='GSAI-ML/LLaDA-8B-Instruct',gen_length=128,steps=128,block_length=64
For Minerva Math, the setup was:
accelerate launch eval_llada.py --tasks minerva_math --model llada_dist --num_fewshot 4 --model_args model_path='GSAI-ML/LLaDA-8B-Instruct',gen_length=256,steps=256,block_length=256
With eight A800 40GB GPUs and a Python environment configured to match the requirements specified in requirements.txt.
Thanks for your interest!
I sincerely apologize for the trouble the evaluation of the Instruct model has caused you. Please refer to EVAL.md for more details. We will attempt to resolve the issues with evaluating LLaDA-Instruct using lm-eval as soon as possible. I apologize again—currently, I'm the sole maintainer of the LLaDA codebase, and I've been swamped with numerous new tasks.
By the way, I've received multiple feedback indicating that the OpenCompass library can relatively easily reproduce the results of LLaDA-Instruct.
+1 on this issue. For reproducing Llama3 benchmark, not sure if you have tried the following evaluation suites in lm-eval (see lm_eval/tasks/llama3/README.md for more details):
mmlu_llamammlu_pro_llamammlu_cot_llamaarc_challenge_llamagsm8k_llama
Also, there is mbpp_instruct, which is for instruction-tuned models.
Look forward to the code release. Any guidance on running LLaDA-Instruct evaluation in other frameworks like OpenCompass is also much appreciated.
+1 on the issue.
+1 on the issue.