unilm
unilm copied to clipboard
ReSA Eval crashed
Describe the bug ReSA
The problem arises when using: When I'm running eval_math_local.sh, it crashed and failed with the import error
A clear and concise description of what the bug is. Console Output:
/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/pyramid/path.py:2: UserWarning: pkg_resources is deprecated as an AP
I. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources
Traceback (most recent call last):
File "/workspace/unilm/ReSA/llm/eval.py", line 11, in <module>
from eval_math import evaluate as evaluate_math
File "/workspace/unilm/ReSA/llm/eval_math.py", line 5, in <module>
from arch.model import create_kv_cache
File "/workspace/unilm/ReSA/llm/arch/model.py", line 8, in <module>
from apex.normalization.fused_layer_norm import fused_rms_norm_affine
File "/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/apex/__init__.py", line 3, in <module>
from apex.i18n import MessageFactory
File "/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/apex/i18n.py", line 1, in <module>
from pyramid.i18n import TranslationStringFactory
File "/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/pyramid/i18n.py", line 20, in <module>
from pyramid.threadlocal import get_current_registry
File "/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/pyramid/threadlocal.py", line 3, in <module>
from pyramid.registry import global_registry
File "/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/pyramid/registry.py", line 12, in <module>
from pyramid.path import CALLER_PACKAGE, caller_package
File "/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/pyramid/path.py", line 4, in <module>
import imp
ModuleNotFoundError: No module named 'imp'
E0708 07:17:39.972000 23464 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 23546) of bi
nary: /home/gitpod/.pyenv/versions/3.12.11/bin/python3 Traceback (most recent call last):
File "/workspace/.pyenv_mirror/user/3.12.11/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py
", line 355, in wrapper return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agen
t raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
eval.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-07-08_07:17:39
host : localhost
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 23546)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Pip List:
Package Version Editable project location
---------------------- ----------- ----------------------------------------
absl-py 2.3.1
accelerate 1.8.1
aiohappyeyeballs 2.6.1
aiohttp 3.12.13
aiosignal 1.4.0
antlr4-python3-runtime 4.11.1
anykeystore 0.2
apex 0.9.10.dev0
chardet 5.2.0
click 8.2.1
colorama 0.4.6
cryptacular 1.6.2
DataProperty 1.1.0
datasets 3.6.0
dill 0.3.8
einops 0.8.1
evaluate 0.4.4
frozenlist 1.7.0
fsspec 2025.3.0
greenlet 3.2.3
hf-xet 1.1.5
huggingface-hub 0.33.2
hupper 1.12.1
joblib 1.5.1
jsonlines 4.0.0
latex2sympy2 1.9.0 /tmp/Qwen2.5-Math/evaluation/latex2sympy
lm_eval 0.4.9
lxml 6.0.0
mbstrdecoder 1.1.4
mpmath 1.3.0
multidict 6.6.3
multiprocess 0.70.16
networkx 3.3
nltk 3.9.1
numexpr 2.11.0
numpy 2.3.1
oauthlib 3.3.1
pandas 2.3.1
PasteDeploy 3.1.0
pathvalidate 3.3.1
pbkdf2 1.3
Pebble 5.1.1
peft 0.16.0
pillow 11.0.0
plaster 1.1.2
plaster-pastedeploy 1.0.1
portalocker 3.2.0
propcache 0.3.2
pyarrow 20.0.0
pybind11 2.13.6
pyramid 1.10.7
pyramid-mailer 0.15.1
pytablewriter 1.2.1
python3-openid 3.2.0
pytz 2025.2
regex 2024.11.6
repoze.sendmail 4.4.1
requests-oauthlib 2.0.0
rouge_score 0.1.2
sacrebleu 2.5.1
safetensors 0.5.3
scikit-learn 1.7.0
scipy 1.16.0
setuptools 80.9.0
SQLAlchemy 2.0.41
sqlitedict 2.1.0
sympy 1.13.3
tabledata 1.3.4
tabulate 0.9.0
tcolorpy 0.1.7
threadpoolctl 3.6.0
timeout-decorator 0.5.0
tokenizers 0.21.2
torch 2.7.1+cpu
torchaudio 2.7.1+cpu
torchvision 0.22.1+cpu
tqdm 4.67.1
tqdm-multiprocess 0.0.11
transaction 5.0
transformers 4.53.1
translationstring 1.4
typepy 1.3.4
tzdata 2025.2
velruse 1.1.1
venusian 3.1.1
WebOb 1.8.9
word2number 1.1
WTForms 3.2.1
wtforms-recaptcha 0.3.2
xxhash 3.5.0
yarl 1.20.1
zope.deprecation 5.1
zope.interface 7.2
zope.sqlalchemy 3.1
zstandard 0.23.0
Expected behavior It should output and do the eval.
- Platform:
- Python version: 3.12
- PyTorch version (GPU?): CPU only
- OS: Linux mengaidev-unilm-vkk0hwuiiok 6.1.139-0601139-generic #202505202314 SMP PREEMPT_DYNAMIC Tue May 20 23:54:01 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
The imp package should be removed in python3.12. yet??
It seems a bug from apex conflict. You can choose official NVIDIA docker. For simplicity, you can also replace apex-version RMSNorm into PyTorch version.