Qwen3 [Badcase]: Cannot Reproduce results on benchmark datasets (e.g. humaneval)

Model Series

Qwen2.5

What are the models used?

Qwen/Qwen1.5-MoE-A2.7B, Qwen/Qwen1.5-7B

What is the scenario where the problem happened?

Cannot reproduce Qwen1.5-MoE 's performance on benchmark datasets.

Is this badcase known and can it be solved using avaiable techniques?

[x] I have followed the GitHub README.
[x] I have checked the Qwen documentation and cannot find a solution there.
[x] I have checked the documentation of the related framework and cannot find useful information.
[x] I have searched the issues and there is not a similar one.

Information about environment

OS: Ubuntu 24.04
Python: Python 3.11
GPUs: 8 x NVIDIA A100
NVIDIA driver: 550
CUDA compiler: 12.4
PyTorch: 2.5.1+cu124

Package Version Editable project location

absl-py 2.1.0 accelerate 1.4.0 aenum 3.1.15 aiohappyeyeballs 2.5.0 aiohttp 3.11.13 aiohttp-cors 0.7.0 aiosignal 1.3.2 airportsdata 20250224 annotated-types 0.7.0 antlr4-python3-runtime 4.13.2 anyio 4.8.0 astor 0.8.1 attrs 25.1.0 bitsandbytes 0.45.3 blake3 1.0.4 bleach 6.2.0 blis 0.7.11 braceexpand 0.1.7 cachetools 5.5.2 catalogue 2.0.10 certifi 2025.1.31 chardet 5.2.0 charset-normalizer 3.4.1 click 8.1.8 clip-benchmark 1.6.1 cloudpathlib 0.16.0 cloudpickle 3.1.1 colorama 0.4.6 colorful 0.5.6 colorlog 6.9.0 compressed-tensors 0.9.1 confection 0.1.5 contourpy 1.3.1 cycler 0.12.1 cymem 2.0.11 DataProperty 1.1.0 datasets 3.3.2 deepspeed 0.15.4 depyf 0.18.0 dill 0.3.8 diskcache 5.6.3 distlib 0.3.9 distro 1.9.0 docker-pycreds 0.4.0 einops 0.8.1 fastapi 0.115.11 filelock 3.17.0 fire 0.7.0 flake8 7.1.2 flash-attn 2.7.3 fonttools 4.56.0 frozenlist 1.5.0 fsspec 2024.12.0 ftfy 6.3.1 gguf 0.10.0 git-lfs 1.6 gitdb 4.0.12 GitPython 3.1.44 google-api-core 2.24.2 google-auth 2.38.0 googleapis-common-protos 1.69.1 grpcio 1.70.0 h11 0.14.0 hf_transfer 0.1.9 hf-xet 1.1.2 hjson 3.1.0 httpcore 1.0.7 httptools 0.6.4 httpx 0.28.1 huggingface 0.0.1 huggingface-hub 0.32.0 human-eval 1.0 /home/ubuntu/tiansheng/Adaptive_MOE/data/human-eval idna 3.10 imagenetv2_pytorch 0.1 importlib_metadata 8.6.1 iniconfig 2.0.0 inquirerpy 0.3.4 interegular 0.3.3 isort 6.0.1 Jinja2 3.1.6 jiter 0.8.2 joblib 1.4.2 jsonlines 4.0.0 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 kaggle 1.7.4.2 kiwisolver 1.4.8 langcodes 3.5.0 langdetect 1.0.9 language_data 1.3.0 lark 1.2.2 latex2sympy2_extended 1.0.6 liger_kernel 0.5.3 lighteval 0.6.0.dev0 lightning-utilities 0.14.1 lm-format-enforcer 0.10.11 lxml 5.3.1 marisa-trie 1.2.1 Markdown 3.7 markdown-it-py 3.0.0 MarkupSafe 3.0.2 math-verify 0.5.2 matplotlib 3.10.1 mbstrdecoder 1.1.4 mccabe 0.7.0 mdurl 0.1.2 mistral_common 1.5.3 mpmath 1.3.0 msgpack 1.1.0 msgspec 0.19.0 multidict 6.1.0 multiprocess 0.70.16 murmurhash 1.0.12 nest-asyncio 1.6.0 networkx 3.4.2 ninja 1.11.1.3 nltk 3.9.1 numpy 1.26.4 nvidia-cublas-cu12 12.4.5.8 nvidia-cuda-cupti-cu12 12.4.127 nvidia-cuda-nvrtc-cu12 12.4.127 nvidia-cuda-runtime-cu12 12.4.127 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.2.1.3 nvidia-curand-cu12 10.3.5.147 nvidia-cusolver-cu12 11.6.1.9 nvidia-cusparse-cu12 12.3.1.170 nvidia-ml-py 12.570.86 nvidia-nccl-cu12 2.21.5 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.4.127 open_clip_torch 2.31.0 open-r1 0.1.0.dev0 /home/ubuntu/src/open-r1/src openai 1.65.5 opencensus 0.11.4 opencensus-context 0.1.3 opencv-python-headless 4.11.0.86 outlines 0.1.11 outlines_core 0.1.26 packaging 24.2 pandas 2.2.3 parameterized 0.9.0 partial-json-parser 0.2.1.1.post5 pathvalidate 3.2.3 pfzy 0.3.4 pillow 11.1.0 pip 25.0.1 platformdirs 4.3.6 pluggy 1.5.0 portalocker 3.1.1 preshed 3.0.9 prometheus_client 0.21.1 prometheus-fastapi-instrumentator 7.0.2 prompt_toolkit 3.0.50 propcache 0.3.0 proto-plus 1.26.1 protobuf 3.20.3 psutil 7.0.0 py-cpuinfo 9.0.0 py-spy 0.4.0 pyarrow 19.0.1 pyasn1 0.6.1 pyasn1_modules 0.4.1 pycocoevalcap 1.2 pycocotools 2.0.8 pycodestyle 2.12.1 pycountry 24.6.1 pydantic 2.10.6 pydantic_core 2.27.2 pyflakes 3.2.0 Pygments 2.19.1 pyparsing 3.2.3 pytablewriter 1.2.1 pytest 8.3.5 python-dateutil 2.9.0.post0 python-dotenv 1.0.1 python-slugify 8.0.4 pytz 2025.1 PyYAML 6.0.2 pyzmq 26.2.1 RapidFuzz 3.13.0 ray 2.43.0 referencing 0.36.2 regex 2024.11.6 requests 2.32.3 rich 13.9.4 rouge_score 0.1.2 rpds-py 0.23.1 rsa 4.9 ruff 0.9.10 sacrebleu 2.5.1 safetensors 0.5.3 scikit-learn 1.6.1 scipy 1.15.2 sentence-transformers 4.1.0 sentencepiece 0.2.0 sentry-sdk 2.22.0 setproctitle 1.3.5 setuptools 75.8.2 shapeguard 0.1.1 six 1.17.0 smart-open 6.4.0 smmap 5.0.2 sniffio 1.3.1 spacy 3.7.2 spacy-legacy 3.0.12 spacy-loggers 1.0.5 srsly 2.5.1 starlette 0.46.1 sympy 1.13.1 tabledata 1.3.4 tabulate 0.9.0 tcolorpy 0.1.7 tensorboard 2.19.0 tensorboard-data-server 0.7.2 termcolor 2.3.0 text-unidecode 1.3 thefuzz 0.22.1 thinc 8.2.5 threadpoolctl 3.5.0 tiktoken 0.9.0 timm 1.0.15 tokenizers 0.21.0 torch 2.5.1 torch_fidelity 0.4.0b0 /home/ubuntu/Projects/src/torch-fidelity torchaudio 2.5.1 torchmetrics 1.6.3 torchvision 0.20.1 tqdm 4.67.1 transformers 4.52.1 transformers-stream-generator 0.0.5 triton 3.1.0 trl 0.16.0.dev0 typepy 1.3.4 typer 0.9.4 typing_extensions 4.12.2 tzdata 2025.1 urllib3 2.3.0 uvicorn 0.34.0 uvloop 0.21.0 virtualenv 20.29.3 vllm 0.7.2 wandb 0.19.8 wasabi 1.1.3 watchfiles 1.0.4 wcwidth 0.2.13 weasel 0.3.4 webdataset 0.2.86 webencodings 0.5.1 websockets 15.0.1 Werkzeug 3.1.3 wheel 0.45.1 wrapt 1.17.2 xformers 0.0.28.post3 xgrammar 0.1.15 xxhash 3.5.0 yarl 1.18.3 zipp 3.21.0

Description

Steps to reproduce

I forked evaluation scripts from https://github.com/QwenLM/Qwen/tree/main/eval and made several adjustments to the Qwen2 model. Take https://github.com/QwenLM/Qwen/blob/main/eval/evaluate_humaneval.py as an example, I rewrite the generate_sample function as :

def generate_sample(model, tokenizer, input_txt): # input_ids = tokenizer.tokenizer.encode(input_txt) # raw_text_len = len(input_ids) # context_enc = torch.tensor([input_ids]).to(model.device) model_inputs = tokenizer(input_txt, return_tensors="pt").to(model.device) print(f"Input text: {input_txt}\n") # outputs = model.generate(context_enc) # output_text = decode(outputs, tokenizer, raw_text_len)[0] generated_ids = model.generate( model_inputs.input_ids, attention_mask=model_inputs.attention_mask, do_sample=True,
# temperature=0.7,
# top_p=0.9,
# num_return_sequences=1 ) input_ids_length = model_inputs.input_ids.shape[1] generated_ids_without_prompt = [output_ids[input_ids_length:] for output_ids in generated_ids] output_text = tokenizer.batch_decode(generated_ids_without_prompt, skip_special_tokens=True)[0] print(f"\nOutput text: \n{output_text}\n") return output_text

Expected results

The results are expected to be 34.2 according to the original https://qwenlm.github.io/blog/qwen-moe/, but I got 3.04 which is substantially lower.

Attempts to fix

Anything else helpful for investigation

I also test Qwen1.5-MoE-A2.7B on MMLU datasets using [which](evaluate_mmlu.py) without any modifications. The average results were 61.38 versus the expected 62.5, which seems to be a normal outcome for this dataset. So, I am wondering if there any specific setting to Qwen1.5-MoE-A2.7B on generation tasks like Humaneval or GSM8K?

May 24 '25 09:05 neilwen987

same problem

May 24 '25 10:05 yanghu819

同好奇，Qwen3的评测设置在哪里可以看到呢？我在enable_thinking=False下，用greedy decode+Qwen3-32B+zero shot去测AIME24/25，都只有二十多分。

May 26 '25 04:05 rangehow

求官方来个eval code, 我自己调的要么输出重复要么停不下来

May 26 '25 07:05 yanghu819

same problem

what problem? did you also test Qwen1.5?

May 27 '25 13:05 jklj077

同好奇，Qwen3的评测设置在哪里可以看到呢？

Open a new issue.

May 27 '25 13:05 jklj077

这个项目维护者疯了？我对评测设置细节的询问明显是有助于这个issue的解决，公开评测细节有助于开发者自行检查可能导致性能与宣传不一致的原因。而我展示的例子也完全贴合本issue标题。上来就在这一堆疯狂标记偏题然后不解决问题是在……？

Gental Reminder. You should use English and let the public know what we are discussing. :)

May 28 '25 09:05 neilwen987

这个项目维护者疯了？我对评测设置细节的询问明显是有助于这个issue的解决，公开评测细节有助于开发者自行检查可能导致性能与宣传不一致的原因。而我展示的例子也完全贴合本issue标题。上来就在这一堆疯狂标记偏题然后不解决问题是在……？

Gental Reminder. You should use English and let the public know what we are discussing. :)

Thank you for your very insightful reminder. However, I'd prefer we didn't bring any more off-topic discussions into this. I hope you're able to resolve your issue soon. : )

May 28 '25 10:05 rangehow

For Qwen1.5 models, we indeed use the evaluation code for Qwen1 models.

For HumanEval, I noticed that you changed do_sample from False to True, compared to the original code. Try setting do sample to False and see if you could get a better result. Otherwise, you may need to conduct Pass@K evaluation, which may require more time.
For GSM8K, what's the result you were getting?

In addition, since you were evaluating the base models, I think the Qwen1 evaluation code should run fine. For chat models, the evaluation code needs modifying.

P.S.: Qwen1.5 is over a year ago and during that time a lot of things have changed. For evaluation of later Qwen models, we mostly use opensource toolkit from the community to make the reproduction eaiser. For example, we use OpenCompass for common tasks and evalplus for HumanEval(+)/MBPP(+). Also check out https://github.com/QwenLM/Qwen2.5-Coder/blob/main/qwencoder-eval and https://github.com/QwenLM/QwQ/tree/main/eval for evaluation on coding and reasoning models if needed.

May 28 '25 13:05 jklj077

Following your suggestion, I changed do sample to False, and got 'pass@1': np.float64(0.07926829268292683). Any thoughts that may lead to this result? Here's the code I modified from the original scripts. I suspect the problem may occur in the generation part(both encode and decode) for Qwen1.5 and also Qwen3 (the implementation of generation in transformers of these models differs from Qwen1). I tried testing Qwen1 on the original scripts and got good results. Maybe you can provide guidance on how to generate_sample correctly using Qwen1.5?

> def generate_sample(model, tokenizer, input_txt):
    model_inputs = tokenizer(input_txt, return_tensors="pt").to(model.device)
    print(f"Input text: {input_txt}\n")
    attention_mask = model_inputs['attention_mask']
    generated_ids = model.generate(
    model_inputs.input_ids,
    attention_mask=attention_mask,
    do_sample=False,
    max_new_tokens=512
)
    generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
    output_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    print(f"\nOutput text: \n{output_text}\n")
    return output_text

Jun 04 '25 13:06 neilwen987

This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread.

Jul 05 '25 08:07 github-actions[bot]

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Aug 22 '25 08:08 github-actions[bot]