vall-e
vall-e copied to clipboard
Fix missing DeepSpeedConfig for deepspeed v0.9.1
Following the comment in https://github.com/microsoft/DeepSpeed/issues/3309 and discussions in #87 and #81 I added a fix for the missing config_class
in trainer.Engine
by adding snippets from deepspeed/init.py.
- Distributed training initialized as default in the same way as
deepspeed.initialize
does. - provide
config_class
asdeepspeed.runtime.config.DeepSpeedConfig
instance
This fixes the problem in #87 as well as in #81.
This is only necessary if you want to use deepspeed>=0.9.1.
I am still geting this issue AttributeError: 'NoneType' object has no attribute 'optimizer_name' Can you please let me know
I am still geting this issue AttributeError: 'NoneType' object has no attribute 'optimizer_name' Can you please let me know
Could you give more information about what you tried to do and at what point you encounter this issue? Because for me this is working fine with the config/test/ar.yml
.
I just gitclone the repo and try to run it and i got this issue
data_dirs: [data/test]
model: ar-quarter batch_size: 1 eval_batch_size: 1 save_ckpt_every: 500 eval_every: 500 max_iter: 1000 This is there in my test/ar.yml
I just gitclone the repo and try to run it and i got this issue
- did you check out my pull request with
gh pr checkout 92
? Invall_e/train.py
there should be a change inload_engines()
. - Did you encounter this when running
python -m vall_e.train yaml=config/test/ar.yml
?
yes when i ran "python -m vall_e.train yaml=config/test/ar.yml" i encounterd the issue
I saw your PR 92 but i am not getting where to put the code snippet
I saw your PR 92 but i am not getting where to put the code snippet
- Open your terminal and navigate to the folder of the repository.
- If you cloned it correctly from git you should be able to get my version by typing the command
gh pr checkout 92
. - If it is still not working, try it with my fork of this repository: kgasenzer/vall-e
this is the new error i am getting as " AttributeError: 'PosixPath' object has no attribute 'log_dir'"
I think a easier solution would be doing deepspeed==0.8.3
Please reopen this PR. It's only a matter of time until we need to update the code to support a version of DeepSpeed > 0.8.3.
With DeepSpeed 0.9.4 I get this error:
(venv) $ pip uninstall deepspeed && pip install deepspeed==0.9.4
(venv) $ python -m vall_e.train yaml=ar.yml
File "/vall-e/vall_e/train.py", line 32, in load_engines
dist.init_distributed(dist_backend=dist_backend)
File "/venv/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 592, in init_distributed
init_deepspeed_backend(get_accelerator().communication_backend_name(), timeout, init_method)
File "/venv/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 148, in init_deepspeed_backend
rank = int(os.environ["RANK"])
~~~~~~~~~~^^^^^^^^
File "<frozen os>", line 679, in __getitem__
KeyError: 'RANK'
It gets past these errors when specifying the following env vars:
(venv) $ RANK=0 WORLD_SIZE=1 python -m vall_e.train yaml=ar.yml
DeepSpeed needs to be hacked to be able to get past https://github.com/microsoft/DeepSpeed/issues/826:
File "/vall-e/vall_e/train.py", line 146, in <module>
main()
File "/vall-e/vall_e/train.py", line 137, in main
trainer.train(
File "/vall-e/vall_e/utils/trainer.py", line 125, in train
engines = engines_loader()
^^^^^^^^^^^^^^^^
File "/vall-e/vall_e/train.py", line 35, in load_engines
dist.init_distributed(dist_backend=dist_backend)
File "/venv/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 615, in init_distributed
mpi_discovery(distributed_port=distributed_port, verbose=verbose)
File "/venv/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 643, in mpi_discovery
result = subprocess.check_output(hostname_cmd, shell=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/subprocess.py", line 466, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['hostname -I']' returned non-zero exit status 64.