Has anyone successfully reproduced the data in the paper using the code from the evaluation? My results are worse than those reported in the paper

Open jianminYa opened this issue 2 months ago • 1 comments

🐛 Describe the bug

Has anyone successfully reproduced the data in the paper using the code from the evaluation? My results are worse than those reported in the paper my test results:

I'm not sure what the cause is. Does anyone have any ideas?

Oct 26 '25 12:10 jianminYa

This bug reflects consistently lower evaluation scores (bleu_score, f1_score, llm_score) than those published in the paper using the same repository code.

Possible cause:

Environment mismatch (incorrect package/model versions versus original paper)
Data preprocessing, tokenization, or evaluation scripts have changed or are not matching the original experiment
Hidden bugs in data loader or scoring function
Metrics are not using the same formulas/configuration as the paper

Potential solution:

Strictly verify environment, package, and model versions against those listed in the paper, including random seed and system setting.
Compare the current evaluation/data scripts line-by-line with those used in the paper; revert or update scripts as needed.
Request maintainers to post their exact reproducible setup (requirements.txt, dataset hashes, configs, and model checkpoints).
Add checks and debug prints to trace scoring logic through each step, validating against published formulas and data samples.
After fixing environment and scripts, rerun the experiment and compare outputs with the paper.

Maintainers: Would you be able to share your full reproducible setup so contributors can match the results? Ready to help test once setup details are available.

Oct 31 '25 17:10 Anand0295