vall-e icon indicating copy to clipboard operation
vall-e copied to clipboard

Fix missing DeepSpeedConfig for deepspeed v0.9.1

Open kgasenzer opened this issue 1 year ago • 12 comments

Following the comment in https://github.com/microsoft/DeepSpeed/issues/3309 and discussions in #87 and #81 I added a fix for the missing config_class in trainer.Engine by adding snippets from deepspeed/init.py.

  • Distributed training initialized as default in the same way as deepspeed.initialize does.
  • provide config_class as deepspeed.runtime.config.DeepSpeedConfig instance

This fixes the problem in #87 as well as in #81.

This is only necessary if you want to use deepspeed>=0.9.1.

kgasenzer avatar Apr 25 '23 13:04 kgasenzer

I am still geting this issue AttributeError: 'NoneType' object has no attribute 'optimizer_name' Can you please let me know

Onkarsus13 avatar Apr 25 '23 14:04 Onkarsus13

I am still geting this issue AttributeError: 'NoneType' object has no attribute 'optimizer_name' Can you please let me know

Could you give more information about what you tried to do and at what point you encounter this issue? Because for me this is working fine with the config/test/ar.yml.

kgasenzer avatar Apr 25 '23 14:04 kgasenzer

I just gitclone the repo and try to run it and i got this issue

Onkarsus13 avatar Apr 25 '23 14:04 Onkarsus13

data_dirs: [data/test]

model: ar-quarter batch_size: 1 eval_batch_size: 1 save_ckpt_every: 500 eval_every: 500 max_iter: 1000 This is there in my test/ar.yml

Onkarsus13 avatar Apr 25 '23 14:04 Onkarsus13

I just gitclone the repo and try to run it and i got this issue

  • did you check out my pull request with gh pr checkout 92 ? In vall_e/train.py there should be a change in load_engines().
  • Did you encounter this when running python -m vall_e.train yaml=config/test/ar.yml ?

kgasenzer avatar Apr 25 '23 14:04 kgasenzer

yes when i ran "python -m vall_e.train yaml=config/test/ar.yml" i encounterd the issue

Onkarsus13 avatar Apr 25 '23 14:04 Onkarsus13

I saw your PR 92 but i am not getting where to put the code snippet

Onkarsus13 avatar Apr 25 '23 14:04 Onkarsus13

I saw your PR 92 but i am not getting where to put the code snippet

  1. Open your terminal and navigate to the folder of the repository.
  2. If you cloned it correctly from git you should be able to get my version by typing the command gh pr checkout 92.
  3. If it is still not working, try it with my fork of this repository: kgasenzer/vall-e

kgasenzer avatar Apr 25 '23 14:04 kgasenzer

this is the new error i am getting as " AttributeError: 'PosixPath' object has no attribute 'log_dir'"

Onkarsus13 avatar Apr 25 '23 15:04 Onkarsus13

I think a easier solution would be doing deepspeed==0.8.3

JonathanColetti avatar May 29 '23 19:05 JonathanColetti

Please reopen this PR. It's only a matter of time until we need to update the code to support a version of DeepSpeed > 0.8.3.

aleb avatar Jun 21 '23 07:06 aleb

With DeepSpeed 0.9.4 I get this error:

(venv) $ pip uninstall deepspeed && pip install deepspeed==0.9.4

(venv) $ python -m vall_e.train yaml=ar.yml
  File "/vall-e/vall_e/train.py", line 32, in load_engines
    dist.init_distributed(dist_backend=dist_backend)
  File "/venv/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 592, in init_distributed
    init_deepspeed_backend(get_accelerator().communication_backend_name(), timeout, init_method)
  File "/venv/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 148, in init_deepspeed_backend
    rank = int(os.environ["RANK"])
               ~~~~~~~~~~^^^^^^^^
  File "<frozen os>", line 679, in __getitem__
KeyError: 'RANK'

It gets past these errors when specifying the following env vars:

(venv) $ RANK=0 WORLD_SIZE=1 python -m vall_e.train yaml=ar.yml

DeepSpeed needs to be hacked to be able to get past https://github.com/microsoft/DeepSpeed/issues/826:

  File "/vall-e/vall_e/train.py", line 146, in <module>
    main()
  File "/vall-e/vall_e/train.py", line 137, in main
    trainer.train(
  File "/vall-e/vall_e/utils/trainer.py", line 125, in train
    engines = engines_loader()
              ^^^^^^^^^^^^^^^^
  File "/vall-e/vall_e/train.py", line 35, in load_engines
    dist.init_distributed(dist_backend=dist_backend)
  File "/venv/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 615, in init_distributed
    mpi_discovery(distributed_port=distributed_port, verbose=verbose)
  File "/venv/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 643, in mpi_discovery
    result = subprocess.check_output(hostname_cmd, shell=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['hostname -I']' returned non-zero exit status 64.

aleb avatar Jun 21 '23 08:06 aleb