unilm icon indicating copy to clipboard operation
unilm copied to clipboard

ReSA Eval crashed

Open MengAiDev opened this issue 5 months ago • 2 comments

Describe the bug ReSA

The problem arises when using: When I'm running eval_math_local.sh, it crashed and failed with the import error

A clear and concise description of what the bug is. Console Output:

/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/pyramid/path.py:2: UserWarning: pkg_resources is deprecated as an AP
I. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.                                                                                import pkg_resources
Traceback (most recent call last):
  File "/workspace/unilm/ReSA/llm/eval.py", line 11, in <module>
    from eval_math import evaluate as evaluate_math
  File "/workspace/unilm/ReSA/llm/eval_math.py", line 5, in <module>
    from arch.model import create_kv_cache
  File "/workspace/unilm/ReSA/llm/arch/model.py", line 8, in <module>
    from apex.normalization.fused_layer_norm import fused_rms_norm_affine
  File "/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/apex/__init__.py", line 3, in <module>
    from apex.i18n import MessageFactory
  File "/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/apex/i18n.py", line 1, in <module>
    from pyramid.i18n import TranslationStringFactory
  File "/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/pyramid/i18n.py", line 20, in <module>
    from pyramid.threadlocal import get_current_registry
  File "/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/pyramid/threadlocal.py", line 3, in <module>
    from pyramid.registry import global_registry
  File "/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/pyramid/registry.py", line 12, in <module>
    from pyramid.path import CALLER_PACKAGE, caller_package
  File "/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/pyramid/path.py", line 4, in <module>
    import imp
ModuleNotFoundError: No module named 'imp'
E0708 07:17:39.972000 23464 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 23546) of bi
nary: /home/gitpod/.pyenv/versions/3.12.11/bin/python3                                                                                 Traceback (most recent call last):
  File "/workspace/.pyenv_mirror/user/3.12.11/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py
", line 355, in wrapper                                                                                                                    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in main
    run(args)
  File "/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/torch/distributed/run.py", line 883, in run
    elastic_launch(
  File "/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/.pyenv_mirror/user/current/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agen
t                                                                                                                                          raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
eval.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-07-08_07:17:39
  host      : localhost
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 23546)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Pip List:

Package                Version     Editable project location
---------------------- ----------- ----------------------------------------
absl-py                2.3.1
accelerate             1.8.1
aiohappyeyeballs       2.6.1
aiohttp                3.12.13
aiosignal              1.4.0
antlr4-python3-runtime 4.11.1
anykeystore            0.2
apex                   0.9.10.dev0
chardet                5.2.0
click                  8.2.1
colorama               0.4.6
cryptacular            1.6.2
DataProperty           1.1.0
datasets               3.6.0
dill                   0.3.8
einops                 0.8.1
evaluate               0.4.4
frozenlist             1.7.0
fsspec                 2025.3.0
greenlet               3.2.3
hf-xet                 1.1.5
huggingface-hub        0.33.2
hupper                 1.12.1
joblib                 1.5.1
jsonlines              4.0.0
latex2sympy2           1.9.0       /tmp/Qwen2.5-Math/evaluation/latex2sympy
lm_eval                0.4.9
lxml                   6.0.0
mbstrdecoder           1.1.4
mpmath                 1.3.0
multidict              6.6.3
multiprocess           0.70.16
networkx               3.3
nltk                   3.9.1
numexpr                2.11.0
numpy                  2.3.1
oauthlib               3.3.1
pandas                 2.3.1
PasteDeploy            3.1.0
pathvalidate           3.3.1
pbkdf2                 1.3
Pebble                 5.1.1
peft                   0.16.0
pillow                 11.0.0
plaster                1.1.2
plaster-pastedeploy    1.0.1
portalocker            3.2.0
propcache              0.3.2
pyarrow                20.0.0
pybind11               2.13.6
pyramid                1.10.7
pyramid-mailer         0.15.1
pytablewriter          1.2.1
python3-openid         3.2.0
pytz                   2025.2
regex                  2024.11.6
repoze.sendmail        4.4.1
requests-oauthlib      2.0.0
rouge_score            0.1.2
sacrebleu              2.5.1
safetensors            0.5.3
scikit-learn           1.7.0
scipy                  1.16.0
setuptools             80.9.0
SQLAlchemy             2.0.41
sqlitedict             2.1.0
sympy                  1.13.3
tabledata              1.3.4
tabulate               0.9.0
tcolorpy               0.1.7
threadpoolctl          3.6.0
timeout-decorator      0.5.0
tokenizers             0.21.2
torch                  2.7.1+cpu
torchaudio             2.7.1+cpu
torchvision            0.22.1+cpu
tqdm                   4.67.1
tqdm-multiprocess      0.0.11
transaction            5.0
transformers           4.53.1
translationstring      1.4
typepy                 1.3.4
tzdata                 2025.2
velruse                1.1.1
venusian               3.1.1
WebOb                  1.8.9
word2number            1.1
WTForms                3.2.1
wtforms-recaptcha      0.3.2
xxhash                 3.5.0
yarl                   1.20.1
zope.deprecation       5.1
zope.interface         7.2
zope.sqlalchemy        3.1
zstandard              0.23.0

Expected behavior It should output and do the eval.

  • Platform:
  • Python version: 3.12
  • PyTorch version (GPU?): CPU only
  • OS: Linux mengaidev-unilm-vkk0hwuiiok 6.1.139-0601139-generic #202505202314 SMP PREEMPT_DYNAMIC Tue May 20 23:54:01 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

MengAiDev avatar Jul 08 '25 07:07 MengAiDev

The imp package should be removed in python3.12. yet??

MengAiDev avatar Jul 08 '25 07:07 MengAiDev

It seems a bug from apex conflict. You can choose official NVIDIA docker. For simplicity, you can also replace apex-version RMSNorm into PyTorch version.

sunyt32 avatar Jul 30 '25 09:07 sunyt32