LLaDA icon indicating copy to clipboard operation
LLaDA copied to clipboard

Inquiry Regarding the Public Release of LLaDA-8B-Instruct Evaluation Code and Result Validation

Open Haohao378 opened this issue 6 months ago • 4 comments

I hope this message finds you well. I'm reaching out regarding the public release of the full evaluation code for LLaDA-8B-Instruct. Currently, when testing on datasets like HumanEval and MBPP, the results I obtained show some discrepancies compared to those reported in your paper. Additionally, when configuring lm-eval for testing on GPQA or MATH, parameters such as block_length have yielded unexpected outcomes, which makes me wonder if there might be specific configurations or settings that I'm missing. Thank you for your time and consideration.

For the GPQA task, the command used was : accelerate launch eval_llada.py --tasks gpqa_main_generative_n_shot --model llada_dist --confirm_run_unsafe_code --num_fewshot 5 --model_args model_path='GSAI-ML/LLaDA-8B-Instruct',gen_length=128,steps=128,block_length=64 For Minerva Math, the setup was: accelerate launch eval_llada.py --tasks minerva_math --model llada_dist --num_fewshot 4 --model_args model_path='GSAI-ML/LLaDA-8B-Instruct',gen_length=256,steps=256,block_length=256

With eight A800 40GB GPUs and a Python environment configured to match the requirements specified in requirements.txt.

Haohao378 avatar Jun 19 '25 04:06 Haohao378

Thanks for your interest!

I sincerely apologize for the trouble the evaluation of the Instruct model has caused you. Please refer to EVAL.md for more details. We will attempt to resolve the issues with evaluating LLaDA-Instruct using lm-eval as soon as possible. I apologize again—currently, I'm the sole maintainer of the LLaDA codebase, and I've been swamped with numerous new tasks.

By the way, I've received multiple feedback indicating that the OpenCompass library can relatively easily reproduce the results of LLaDA-Instruct.

nieshenx avatar Jun 30 '25 02:06 nieshenx

+1 on this issue. For reproducing Llama3 benchmark, not sure if you have tried the following evaluation suites in lm-eval (see lm_eval/tasks/llama3/README.md for more details):

  • mmlu_llama
  • mmlu_pro_llama
  • mmlu_cot_llama
  • arc_challenge_llama
  • gsm8k_llama

Also, there is mbpp_instruct, which is for instruction-tuned models.

Look forward to the code release. Any guidance on running LLaDA-Instruct evaluation in other frameworks like OpenCompass is also much appreciated.

ZhaozhiQIAN avatar Jul 01 '25 16:07 ZhaozhiQIAN

+1 on the issue.

Rachum-thu avatar Jul 02 '25 22:07 Rachum-thu

+1 on the issue.

Kairong-Han avatar Jul 11 '25 06:07 Kairong-Han