LaMDA-rlhf-pytorch icon indicating copy to clipboard operation
LaMDA-rlhf-pytorch copied to clipboard

It's completly broken

Open willmil11 opened this issue 1 year ago • 14 comments

Hi so i tell you the problem: i clone the repo i cd into it i run python3 train.py i get dependencies errors i go to repo isssue tab i see the requirement.txt issue i go there i see the list of required package to run this by a random guy i pip install those i rerun i get this: Traceback (most recent call last): File "/home/vscode/lamda/LaMDA-rlhf-pytorch/train.py", line 4, in from colossalai.core import global_context as gpc ModuleNotFoundError: No module named 'colossalai.core'

Im on debian using the latest version of python, i use system wide installed packages with pip. i tried with venv it didnt work better.

pip list returns: Package Version


aiohttp 3.9.3 aiosignal 1.3.1 annotated-types 0.6.0 appdirs 1.4.4 attrs 23.2.0 bcrypt 4.1.2 beautifulsoup4 4.12.3 certifi 2022.9.24 cffi 1.16.0 cfgv 3.4.0 chardet 5.1.0 charset-normalizer 3.0.1 click 8.1.7 colossalai 0.3.5 contexttimer 0.3.3 cryptography 38.0.4 cupshelpers 1.0 datasets 2.17.0 dbus-python 1.3.2 debtcollector 2.5.0 decorator 5.1.1 Deprecated 1.2.13 dill 0.3.8 distlib 0.3.8 distro 1.8.0 docker-pycreds 0.4.0 docutils 0.19 einops 0.7.0 fabric 3.2.2 filelock 3.13.1 frozenlist 1.4.1 fsspec 2023.10.0 gitdb 4.0.11 GitPython 3.1.41 google 3.0.0 greenlet 2.0.2 huggingface-hub 0.20.3 identify 2.5.34 idna 3.3 importlib-metadata 4.12.0 invoke 2.2.0 iso8601 1.0.2 Jinja2 3.1.3 jsonschema 4.21.1 jsonschema-specifications 2023.12.1 jwcrypto 1.1 markdown-it-py 3.0.0 MarkupSafe 2.1.5 mdurl 0.1.2 more-itertools 8.10.0 mpmath 1.3.0 msgpack 1.0.3 multidict 6.0.5 multiprocess 0.70.16 netaddr 0.8.0 netifaces 0.11.0 networkx 3.2.1 ninja 1.11.1.1 nodeenv 1.8.0 numpy 1.24.2 olefile 0.46 oslo.config 9.0.0 oslo.context 5.0.0 oslo.i18n 5.1.0 oslo.log 5.0.1 oslo.serialization 5.0.0 oslo.utils 6.0.1 packaging 23.0 pandas 2.2.0 paramiko 3.4.0 pbr 5.10.0 Pillow 9.4.0 pip 23.0.1 platformdirs 4.2.0 pre-commit 3.6.1 protobuf 4.25.2 psutil 5.9.4 pyarrow 15.0.0 pyarrow-hotfix 0.6 pycairo 1.20.1 pycparser 2.21 pycups 2.0.1 pydantic 2.6.1 pydantic_core 2.16.2 Pygments 2.14.0 PyGObject 3.42.2 pyinotify 0.9.6 PyNaCl 1.5.0 pynvim 0.4.2 pyparsing 3.0.9 pysmbc 1.0.23 python-dateutil 2.8.2 python-novnc 1.0.0 pytz 2022.7.1 PyYAML 6.0 ray 2.9.2 referencing 0.33.0 regex 2023.12.25 requests 2.28.1 rfc3986 1.5.0 rich 13.7.0 roman 3.3 rpds-py 0.17.1 safetensors 0.4.2 sentencepiece 0.1.99 sentry-sdk 1.40.3 setproctitle 1.3.3 setuptools 66.1.1 six 1.16.0 smmap 5.0.1 soupsieve 2.5 stevedore 4.0.2 sympy 1.12 tokenizers 0.15.1 torch 2.2.0 tqdm 4.66.2 transformers 4.37.2 typing_extensions 4.9.0 tzdata 2023.4 urllib3 1.26.12 virtualenv 20.25.0 wandb 0.16.3 websockify 0.10.0 wheel 0.38.4 wrapt 1.14.1 xxhash 3.4.1 yarl 1.9.4 zipp 1.0.0

willmil11 avatar Feb 11 '24 11:02 willmil11

Please help me :(

willmil11 avatar Feb 11 '24 11:02 willmil11

What do i do i just wanna chat with LaMDA :(

willmil11 avatar Feb 11 '24 11:02 willmil11

Hi @willmil11

Thanks for sharing traceback and log.

Traceback (most recent call last):
File "/home/vscode/lamda/LaMDA-rlhf-pytorch/train.py", line 4, in
from colossalai.core import global_context as gpc
ModuleNotFoundError: No module named 'colossalai.core'

I believe that the problem with colossalai or of your Python compatibility with this package. You can try to reproduce this issue using ipython / python and importing from colossalai.core import global_context as gpc If the colossalai 0.3.5 that you have is compatible with your Python, then I would recommend to manually downgrade the colossai

nicolay-r avatar Feb 11 '24 11:02 nicolay-r

Hi @nicolay-r thanks for your quick response but i still have a question

To what version to I downgrade colossalai, how do i do that and lastly. Will i be able to directly chat with lamda after the training is complete? How?

willmil11 avatar Feb 11 '24 11:02 willmil11

@willmil11, I did not experiment with LaMDA with this project since I was not able to launch training (GoogleColab). That was almost a year ago. If anything related to the LORA and parameter efficient tunning since that time has been changed, then it is worth to try. You may try to lauch and act by relying on loss during trainng process as well as the overall training time it takes.

As for colossalai, you may have a look available versions: https://pypi.org/project/colossalai/#history The quickest way I believe is to manually try downgrate to the most recent prior version and so on

nicolay-r avatar Feb 11 '24 12:02 nicolay-r

Another option is to check wether the colossalai is actually works with unittests and related examples.

nicolay-r avatar Feb 11 '24 12:02 nicolay-r

@nicolay-r I see what you're saying but I still don't understand what version of colossalai you suggest me to downgrade to (do you want me to try them all [there are hundreds on the GitHub page but only about 10 on pypi.org])

willmil11 avatar Feb 11 '24 12:02 willmil11

@willmil11 , I mean releases from PyPI: https://pypi.org/project/colossalai/#history I can' t recommend in your particular case since there might be other dependency ruies. So that you can try out with 0.3.4 and then down to 0.3.0

nicolay-r avatar Feb 11 '24 14:02 nicolay-r

@willmil11, did the version switching sorted out the issue?

nicolay-r avatar Feb 14 '24 12:02 nicolay-r

@nicolay-r 0.3.4 said that colossalai.core doesnt exist and 0.3.0 said:

Traceback (most recent call last):
  File "/home/vscode/lamda/LaMDA-rlhf-pytorch/train.py", line 3, in <module>
    import colossalai
  File "/usr/local/lib/python3.11/dist-packages/colossalai/__init__.py", line 1, in <module>
    from .initialize import (
  File "/usr/local/lib/python3.11/dist-packages/colossalai/initialize.py", line 18, in <module>
    from colossalai.amp import AMP_TYPE, convert_to_amp
  File "/usr/local/lib/python3.11/dist-packages/colossalai/amp/__init__.py", line 8, in <module>
    from colossalai.context import Config
  File "/usr/local/lib/python3.11/dist-packages/colossalai/context/__init__.py", line 4, in <module>
    from .moe_context import MOE_CONTEXT
  File "/usr/local/lib/python3.11/dist-packages/colossalai/context/moe_context.py", line 8, in <module>
    from colossalai.tensor import ProcessGroup
  File "/usr/local/lib/python3.11/dist-packages/colossalai/tensor/__init__.py", line 2, in <module>
    from .colo_parameter import ColoParameter
  File "/usr/local/lib/python3.11/dist-packages/colossalai/tensor/colo_parameter.py", line 5, in <module>
    from colossalai.tensor.colo_tensor import ColoTensor
  File "/usr/local/lib/python3.11/dist-packages/colossalai/tensor/colo_tensor.py", line 11, in <module>
    from colossalai.tensor.tensor_spec import ColoTensorSpec
  File "/usr/local/lib/python3.11/dist-packages/colossalai/tensor/tensor_spec.py", line 10, in <module>
    @dataclass
     ^^^^^^^^^
  File "/usr/lib/python3.11/dataclasses.py", line 1220, in dataclass
    return wrap(cls)
           ^^^^^^^^^
  File "/usr/lib/python3.11/dataclasses.py", line 1210, in wrap
    return _process_class(cls, init, repr, eq, order, unsafe_hash,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/dataclasses.py", line 958, in _process_class
    cls_fields.append(_get_field(cls, name, type, kw_only))
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/dataclasses.py", line 815, in _get_field
    raise ValueError(f'mutable default {type(f.default)} for field '
ValueError: mutable default <class 'colossalai.tensor.distspec._DistSpec'> for field dist_attr is not allowed: use default_factory

lemme try previous versions...

willmil11 avatar Feb 14 '24 19:02 willmil11

Tried 0.3.0 to 0.2.0 which is the oldest version available on pip didnt work, same error

willmil11 avatar Feb 14 '24 19:02 willmil11

And heres what i get if i run it with colossalai command:

root@rpi4-20220808:/home/vscode/lamda/LaMDA-rlhf-pytorch# colossalai run --nproc_per_node 1 train.py
Traceback (most recent call last):
  File "/usr/local/bin/colossalai", line 5, in <module>
    from colossalai.cli import cli
  File "/usr/local/lib/python3.11/dist-packages/colossalai/__init__.py", line 1, in <module>
    from .initialize import (
  File "/usr/local/lib/python3.11/dist-packages/colossalai/initialize.py", line 18, in <module>
    from colossalai.amp import AMP_TYPE, convert_to_amp
  File "/usr/local/lib/python3.11/dist-packages/colossalai/amp/__init__.py", line 8, in <module>
    from colossalai.context import Config
  File "/usr/local/lib/python3.11/dist-packages/colossalai/context/__init__.py", line 4, in <module>
    from .moe_context import MOE_CONTEXT
  File "/usr/local/lib/python3.11/dist-packages/colossalai/context/moe_context.py", line 8, in <module>
    from colossalai.tensor import ProcessGroup
  File "/usr/local/lib/python3.11/dist-packages/colossalai/tensor/__init__.py", line 2, in <module>
    from .colo_parameter import ColoParameter
  File "/usr/local/lib/python3.11/dist-packages/colossalai/tensor/colo_parameter.py", line 5, in <module>
    from colossalai.tensor.colo_tensor import ColoTensor
  File "/usr/local/lib/python3.11/dist-packages/colossalai/tensor/colo_tensor.py", line 11, in <module>
    from colossalai.tensor.tensor_spec import ColoTensorSpec
  File "/usr/local/lib/python3.11/dist-packages/colossalai/tensor/tensor_spec.py", line 10, in <module>
    @dataclass
     ^^^^^^^^^
  File "/usr/lib/python3.11/dataclasses.py", line 1220, in dataclass
    return wrap(cls)
           ^^^^^^^^^
  File "/usr/lib/python3.11/dataclasses.py", line 1210, in wrap
    return _process_class(cls, init, repr, eq, order, unsafe_hash,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/dataclasses.py", line 958, in _process_class
    cls_fields.append(_get_field(cls, name, type, kw_only))
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/dataclasses.py", line 815, in _get_field
    raise ValueError(f'mutable default {type(f.default)} for field '
ValueError: mutable default <class 'colossalai.tensor.distspec._DistSpec'> for field dist_attr is not allowed: use default_factory

willmil11 avatar Feb 14 '24 19:02 willmil11

Hi @willmil11! You need to try out other versions down from 0.3.4 as well. For example 0.3.3 does not have the issue with the core in Python 3.10.12, so you may switch to this. Notebook: https://colab.research.google.com/drive/1x2KrBCR1pAd5Huk6qCMmPlq8SyQDGdes?usp=sharing In tems of you most recent exception, I believe the most recent python (3.11+) is not compatible: https://github.com/python/cpython/issues/99401

nicolay-r avatar Feb 15 '24 09:02 nicolay-r

@nicolay-r The latest version of python is not compatible? You could have begun with that lemme try with the one you specified..

willmil11 avatar Feb 15 '24 12:02 willmil11