llm-foundry
llm-foundry copied to clipboard
Fused Cross Entropy is not installed. Either (1) have a CUDA-compatible GPU and `pip install .[gpu]`, or (2) set your config model.loss_fn=torch_crossentropy.
Initializing model...
Traceback (most recent call last):
File "/content/llm-foundry/llmfoundry/models/mpt/modeling_mpt.py", line 619, in init
from flash_attn.losses.cross_entropy import CrossEntropyLoss as FusedCrossEntropyLoss # type: ignore # isort: skip
File "/usr/local/lib/python3.10/dist-packages/flash_attn/losses/cross_entropy.py", line 9, in
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/content/llm-foundry/scripts/train/train.py", line 254, in pip install .[gpu], or (2) set your config model.loss_fn=torch_crossentropy.
ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 0 (PID 13871) exited with code 1
ERROR:composer.cli.launcher:Global rank 0 (PID 13871) exited with code 1
It looks like you didn't install all the requirements (pip install .[gpu] from the top level dir).
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. torchdata 0.6.0 requires torch==2.0.0, but you have torch 1.13.1 which is incompatible. torchaudio 2.0.1+cu118 requires torch==2.0.0, but you have torch 1.13.1 which is incompatible.
it installs the old version of torch
(pip install .[gpu] thats the command i ran
Yes the current setup depends on torch==1.13.1 torch2 has some issues but we will upgrade when these are resolved (Note: people have run our repo with torch2, but it does disable some of the features)
I'd recommend creating a new venv (python -m venv venv) and installing the requirements from scratch (pip install .[gpu])
(you might need to install torch first: pip install torch==1.13.1)
I am seeing the exact same issue, and I did install with pip install .[gpu]. My torch version is also 1.13.1, I have a venv specifically for this project.
I guess the problem is from the CUDA version not PyTorch version.
Im having the same issue. Im using the docker image [2.0.1_cu117-python3.10-ubuntu20.04-aws] and don't use a Venv. I can run the hf_ generate.py but I can't run train.py
Im having the same issue. Im using the docker image [2.0.1_cu117-python3.10-ubuntu20.04-aws] and don't use a Venv. I can run the hf_ generate.py but I can't run train.py
Use the recommended docker image
thanks my mistake