sdft
sdft copied to clipboard
llama2-7b种子模型结果有问题
reproduce分支以及main分支代码均运行过
跑出来结果差距过大:
reproduce分支
Evaluation on gsm8k: Accuracy for math: 26 / 1319 = 1.97%
Evaluation on multiarith: Accuracy for math: 4 / 180 = 2.22%
main分支:
Evaluation on gsm8k: Accuracy for math: 18 / 1319 = 1.36%
Evaluation on multiarith: Accuracy for math: 3 / 180 = 1.67%
Evaluation on OpenFunctions: Accuracy for openfunction: 8 / 112 = 7.14%
请问是什么原因导致呢?环境如下:
Package Version Editable project location
absl-py 2.1.0 accelerate 0.23.0 alpaca_eval 0.6.2 altair 5.3.0 annotated-types 0.6.0 anyio 4.3.0 apex 0.1 archspec 0.2.1 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 asttokens 2.0.5 async-timeout 4.0.3 attrs 23.1.0 backcall 0.2.0 beautifulsoup4 4.12.2 bigcode-eval 0.0.0 sdft/bigcode-evaluation-harness bleach 6.1.0 blis 0.7.11 boltons 23.0.0 Brotli 1.0.9 cachetools 3.1.1 catalogue 2.0.10 certifi 2024.2.2 cffi 1.16.0 chardet 4.0.0 charset-normalizer 2.0.4 click 8.1.7 cloudpathlib 0.16.0 cloudpickle 3.0.0 colorama 0.4.6 comm 0.2.1 conda 24.1.2 conda-build 24.1.2 conda-content-trust 0.2.0 conda_index 0.4.0 conda-libmamba-solver 23.12.0 conda-package-handling 2.2.0 conda_package_streaming 0.9.0 confection 0.1.4 configparser 6.0.1 contourpy 1.2.0 crcmod 1.7 cryptography 42.0.2 cycler 0.12.1 cymem 2.0.8 DataProperty 1.0.1 datasets 2.15.0 debugpy 1.8.1 decorator 5.1.1 deepspeed 0.11.1 defusedxml 0.7.1 deprecation 2.1.0 diffusers 0.27.2 dill 0.3.7 distlib 0.3.8 distro 1.8.0 docopt 0.6.2 docstring_parser 0.16 easydict 1.13 einops 0.7.0 entrypoints 0.4 evaluate 0.4.1 exceptiongroup 1.2.0 executing 0.8.3 fastapi 0.95.1 fastjsonschema 2.19.1 fe 0.3.33 ffmpy 0.3.2 filelock 3.13.1 fire 0.6.0 flash-attention 1.0.0 flash-attn 2.5.6 fonttools 4.50.0 frozenlist 1.4.1 fsspec 2023.6.0 google-auth 2.29.0 gradio 3.38.0 gradio_client 0.7.1 h11 0.14.0 hjson 3.1.0 httpcore 1.0.5 httpx 0.27.0 huggingface-hub 0.23.0 idna 3.4 importlib_metadata 7.1.0 importlib_resources 6.4.0 intel-openmp 2024.0.2 ipykernel 5.4.2 ipython 7.9.0 ipython-genutils 0.2.0 jedi 0.18.1 jieba 0.42.1 Jinja2 3.1.3 jinjasql 0.1.8 jmespath 0.10.0 joblib 1.3.2 jsonlines 4.0.0 jsonpatch 1.32 jsonpointer 2.1 jsonschema 4.19.2 jsonschema-specifications 2023.7.1 jupyter_client 8.6.0 jupyter_core 5.7.1 jupyterlab_pygments 0.3.0 kiwisolver 1.4.5 kubemaker 0.2.19 kubernetes 9.0.0 langcodes 3.3.0 libarchive-c 2.9 libmambapy 1.5.3 linkify-it-py 2.0.3 llmtuner 0.6.4.dev0 sdft/LLaMA-Factory lm_eval 0.4.1 lxml 5.2.1 markdown-it-py 2.2.0 MarkupSafe 2.1.5 matplotlib 3.8.0 matplotlib-inline 0.1.6 mbstrdecoder 1.1.3 mdit-py-plugins 0.3.3 mdurl 0.1.2 menuinst 2.0.2 mistune 0.8.4 mkl 2024.0.0 mkl-fft 1.3.8 mkl-random 1.2.4 mkl-service 2.4.0 more-itertools 10.1.0 mosestokenizer 1.0.0 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.15 murmurhash 1.0.10 nbclient 0.5.13 nbconvert 6.0.7 nbformat 5.10.2 nest-asyncio 1.6.0 networkx 3.2.1 ninja 1.11.1.1 nltk 3.8.1 notebook 6.4.6 numexpr 2.10.0 numpy 1.23.5 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.18.1 nvidia-nvjitlink-cu12 12.4.99 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.2 openai 1.23.6 openfile 0.0.7 orjson 3.10.1 osscmd 0.4.5 packaging 23.1 pandas 2.2.1 pandocfilters 1.5.1 parso 0.8.3 pathvalidate 3.2.0 patsy 0.5.6 peft 0.6.0 peppercorn 0.6 pexpect 4.8.0 pickleshare 0.7.5 pillow 10.2.0 pip 24.0 pkginfo 1.9.6 platformdirs 3.10.0 pluggy 1.0.0 portalocker 2.8.2 preshed 3.0.9 prometheus_client 0.20.0 prompt-toolkit 2.0.10 protobuf 4.24.4 psutil 5.9.0 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyaml 21.10.1 pyarrow 15.0.1 pyarrow-hotfix 0.6 pyasn1 0.6.0 pyasn1_modules 0.4.0 pybind11 2.12.0 pycosat 0.6.6 pycparser 2.21 pycryptodome 3.20.0 pydantic 1.10.11 pydantic_core 2.18.2 pydub 0.25.1 pyext 0.7 Pygments 2.15.1 pyhocon 0.3.60 pyinotify 0.9.6 pynvml 11.5.0 pyodps 0.11.4 pyOpenSSL 24.0.0 pyparsing 2.4.5 PySocks 1.7.1 pytablewriter 1.2.0 python-dateutil 2.9.0.post0 python-dotenv 1.0.1 python-multipart 0.0.9 pytz 2023.3.post1 PyYAML 6.0.1 pyzmq 25.1.2 referencing 0.30.2 regex 2023.12.25 requests 2.31.0 requests-oauthlib 2.0.0 responses 0.18.0 rich 13.7.1 rouge-chinese 1.0.3 rouge-score 0.1.2 rpds-py 0.10.6 rsa 4.9 ruamel.yaml 0.17.21 ruamel.yaml.clib 0.2.6 ruff 0.4.2 sacrebleu 2.4.2 safetensors 0.4.2 scikit-learn 1.4.2 scipy 1.11.3 seaborn 0.13.2 semantic-version 2.10.0 Send2Trash 1.8.2 sentence-transformers 2.2.2 sentencepiece 0.1.99 setuptools 68.2.2 shellingham 1.5.4 shtab 1.7.1 six 1.16.0 smart-open 6.4.0 sniffio 1.3.1 soupsieve 2.5 spacy 3.7.4 spacy-legacy 3.0.12 spacy-loggers 1.0.5 sqlitedict 2.1.0 sqlparse 0.4.4 srsly 2.4.8 sse-starlette 1.6.5 stack-data 0.2.0 starlette 0.26.1 sympy 1.12 tabledata 1.3.3 tabulate 0.9.0 tbb 2021.11.0 tcolorpy 0.1.6 tensorboardX 2.6.2.2 termcolor 2.4.0 terminado 0.18.1 testpath 0.6.0 thinc 8.2.3 threadpoolctl 3.4.0 tiktoken 0.5.1 timm 0.9.16 tokenizers 0.14.1 tomli 2.0.1 tomlkit 0.12.0 toolwrapper 2.1.0 toolz 0.12.1 torch 2.1.0 torchaudio 2.1.0 torchvision 0.16.0 tornado 6.4 tqdm 4.65.0 tqdm-multiprocess 0.0.11 traitlets 5.7.1 transformer_engine 1.4.0+0fbc76a transformers 4.34.1 transformers-stream-generator 0.0.5 triton 2.1.0 trl 0.7.4 truststore 0.8.0 typeguard 4.1.5 typepy 1.3.2 typer 0.12.3 typing_extensions 4.10.0 tyro 0.8.3 tzdata 2024.1 uc-micro-py 1.0.3 urllib3 1.26.4 uvicorn 0.23.2 virtualenv 20.25.1 wasabi 1.1.2 wcwidth 0.2.5 weasel 0.3.4 webencodings 0.5.1 websocket-client 1.7.0 websockets 11.0.3 wfbuilder 1.0.56.43 wget 3.2 wheel 0.41.3 word2number 1.1 xformers 0.0.22.post7 xxhash 3.4.1 yarl 1.9.4 zipp 3.18.1 zstandard 0.19.0
@447428054 thanks for your interest! Since the seed lm should be an LLM after instruction tuning, may I know if you have already used the correct version of the model?
It is essential that the chat model is utilized instead of the base model for these experiments. Should this adjustment fail to resolve the issue, please provide samples of the output generated during evaluations on the GSM8K dataset, which should be located at predictions/seed/gsm8k/generated_predictions.jsonl
, for further analysis.
@rickyang1114 感谢回复,重新使用llama2-7b-chat,模型下载地址:https://modelscope.cn/models/modelscope/Llama-2-7b-chat-ms/summary
跑出来的结果,openfunction值差距较大...其他结果正常
Evaluation on seed LM.
Evaluation on gsm8k: Accuracy for math: 383 / 1319 = 29.04%
Evaluation on multiarith: Accuracy for math: 135 / 180 = 75.00%
Evaluation on OpenFunctions: Accuracy for openfunction: 11 / 112 = 9.82%
Evaluation on HumanEval: Accuracy for HumanEval: 14.63%
Evaluation on raw safety: file: predictions/seed/Llama-2-chat-7b/advbench-raw/generated_predictions.jsonl, safe_rate: 99.42%
Evaluation on jailbreak safety: file: predictions/seed/Llama-2-chat-7b/advbench-jailbreak/generated_predictions.jsonl, safe_rate: 93.85%
Evaluation on MMLU: Average: 46.43 STEM: 35.83 Social Sciences: 53.05 Humanities: 43.37 Other: 54.43
Evaluation on OpenLLM Leaderboard: Accuracy for truthfulqa: 35.18% Accuracy for ai2_arc: 64.06% Accuracy for hellaswag: 57.78% Accuracy for winogrande: 66.46%
The test set comprises a modest total of 112 instances of openfunctions, a factor that could contribute to the observed variability. Our experimental results were obtained using the PyTorch version 2.1.0+cu118, on an Ubuntu 18.04 operating system.