axolotl
axolotl copied to clipboard
ModuleNotFoundError: No module named 'mpi4py' using single GPU with deepspeed
Please check that this issue hasn't been reported before.
- [X] I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
Training mixtral with axolotl
Current behaviour
Shows an error
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 677, in mpi_discovery
from mpi4py import MPI
ModuleNotFoundError: No module named 'mpi4py'
Traceback (most recent call last):
File "/root/miniconda3/envs/py3.9/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1023, in launch_command
simple_launcher(args)
File "/root/miniconda3/envs/py3.9/lib/python3.9/site-packages/accelerate/commands/launch.py", line 643, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/root/miniconda3/envs/py3.9/bin/python3', '-m', 'axolotl.cli.train', 'axolotl/examples/mistral/mixtral.yml']' returned non-zero exit status 1.
If I try pip install mpi4py
it shows this error
/root/miniconda3/envs/py3.9/compiler_compat/ld: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so: undefined reference to `opal_list_t_class'
collect2: error: ld returned 1 exit status
failure.
removing: _configtest.c _configtest.o
error: Cannot link MPI programs. Check your configuration!!!
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for mpi4py
Failed to build mpi4py
ERROR: Could not build wheels for mpi4py, which is required to install pyproject.toml-based projects
Steps to reproduce
-
start a machine with Dockerfile*
-
run axolotl mixtral
accelerate launch -m axolotl.cli.train axolotl/examples/mistral/mixtral.yml
- Dockerfile
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
WORKDIR /
RUN mkdir /workspace
SHELL ["/bin/bash", "-o", "pipefail", "-c"]
ENV DEBIAN_FRONTEND noninteractive\
SHELL=/bin/bash
RUN apt-get update --yes && \
# - apt-get upgrade is run to patch known vulnerabilities in apt-get packages as
# the ubuntu base image is rebuilt too seldom sometimes (less than once a month)
apt-get upgrade --yes && \
apt install --yes --no-install-recommends\
git\
wget\
curl\
bash\
software-properties-common\
openssh-server
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt install python3.10 -y --no-install-recommends && \
ln -s /usr/bin/python3.10 /usr/bin/python && \
rm /usr/bin/python3 && \
ln -s /usr/bin/python3.10 /usr/bin/python3 && \
apt-get clean && rm -rf /var/lib/apt/lists/* && \
echo "en_US.UTF-8 UTF-8" > /etc/locale.gen
RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
RUN python get-pip.py
RUN pip install --no-cache-dir --pre torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --extra-index-url https://download.pytorch.org/whl/nightly/cu118
RUN pip install --no-cache-dir -U jupyterlab ipywidgets jupyter-archive
# RUN jupyter nbextension enable --py widgetsnbextension
RUN jupyter labextension disable "@jupyterlab/apputils-extension:announcements"
ADD start.sh /
RUN chmod +x /start.sh
CMD [ "/start.sh" ]
Config yaml
No response
Possible solution
No response
Which Operating Systems are you using?
- [ ] Linux
- [ ] macOS
- [ ] Windows
Python Version
Python 3.9.16
axolotl branch-commit
main
Acknowledgements
- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.