TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Is MPI required even multi device is disabled?

Open jlewi opened this issue 1 year ago • 5 comments

System Info

  • CPU x86_64

Who can help?

No response

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

I'm trying to build the wheel as follows

python3 ../tensorrt_llm/scripts/build_wheel.py --trt_root ${TRT_ROOT} -D "CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-12.3/" -D "ENABLE_MULTI_DEVICE=0"

I end up with a linking error because MPI is missing.

[100%] Building CXX object tensorrt_llm/executor_worker/CMakeFiles/executorWorker.dir/executorWorker.cpp.o
[100%] Linking CXX executable executorWorker
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `ompi_mpi_char'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `MPI_Wait'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `MPI_Mrecv'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `ompi_mpi_uint64_t'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `MPI_Comm_spawn'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `MPI_Get_count'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `tensorrt_llm::mpi::MpiComm::MpiComm(ompi_communicator_t*, bool)'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `ompi_mpi_comm_self'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `MPI_Info_set'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `ompi_mpi_comm_world'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `tensorrt_llm::mpi::MpiComm::mprobe(int, int, ompi_message_t**, ompi_status_public_t*) const'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `MPI_Info_create'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `MPI_Barrier'
collect2: error: ld returned 1 exit status
make[3]: *** [tensorrt_llm/executor_worker/CMakeFiles/executorWorker.dir/build.make:112: tensorrt_llm/executor_worker/executorWorker] Error 1
make[2]: *** [CMakeFiles/Makefile2:1192: tensorrt_llm/executor_worker/CMakeFiles/executorWorker.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:1199: tensorrt_llm/executor_worker/CMakeFiles/executorWorker.dir/rule] Error 2
make: *** [Makefile:335: executorWorker] Error 2
Traceback (most recent call last):
  File "/home/build/backend/build/../tensorrt_llm/scripts/build_wheel.py", line 352, in <module>
    main(**vars(args))
  File "/home/build/backend/build/../tensorrt_llm/scripts/build_wheel.py", line 166, in main
    build_run(
  File "/usr/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,

I don't have MPI which is why I was disabling multi-device.

Expected behavior

I expect this to compile with out MPI being needed. My assumption was that MPI is only required for multi-device. That assumption could be incorrect. I was hoping to be able to compile for single device without needing MPI. Is MPI needed even for single device?

actual behavior

I got a linking error because MPI is missing

[100%] Building CXX object tensorrt_llm/executor_worker/CMakeFiles/executorWorker.dir/executorWorker.cpp.o
[100%] Linking CXX executable executorWorker
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `ompi_mpi_char'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `MPI_Wait'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `MPI_Mrecv'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `ompi_mpi_uint64_t'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `MPI_Comm_spawn'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `MPI_Get_count'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `tensorrt_llm::mpi::MpiComm::MpiComm(ompi_communicator_t*, bool)'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `ompi_mpi_comm_self'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `MPI_Info_set'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `ompi_mpi_comm_world'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `tensorrt_llm::mpi::MpiComm::mprobe(int, int, ompi_message_t**, ompi_status_public_t*) const'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `MPI_Info_create'
/usr/lib/gcc/x86_64-pc-linux-gnu/12.4.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../libtensorrt_llm.so: undefined reference to `MPI_Barrier'
collect2: error: ld returned 1 exit status
make[3]: *** [tensorrt_llm/executor_worker/CMakeFiles/executorWorker.dir/build.make:112: tensorrt_llm/executor_worker/executorWorker] Error 1
make[2]: *** [CMakeFiles/Makefile2:1192: tensorrt_llm/executor_worker/CMakeFiles/executorWorker.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:1199: tensorrt_llm/executor_worker/CMakeFiles/executorWorker.dir/rule] Error 2
make: *** [Makefile:335: executorWorker] Error 2
Traceback (most recent call last):
  File "/home/build/backend/build/../tensorrt_llm/scripts/build_wheel.py", line 352, in <module>
    main(**vars(args))
  File "/home/build/backend/build/../tensorrt_llm/scripts/build_wheel.py", line 166, in main
    build_run(
  File "/usr/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,

additional notes

I also had to remove mpi4py from requirements.txt to try to get to build without MultiDevice support.

  # N.B Hack: We remove mpi4py from the requirements because we don't have mpi libraries.
      # Hopefully that should only be needed for multi device support
      sed '/mpi4py/d' -i ../tensorrt_llm/requirements.txt 

jlewi avatar Jul 16 '24 21:07 jlewi

@Funatiq Could you please have a look? Thanks

QiJune avatar Jul 17 '24 03:07 QiJune

Could you try with the following option to build_wheel.py

--extra-cmake-vars ENABLE_MULTI_DEVICE=0

achartier avatar Jul 18 '24 20:07 achartier

I'm trying to build it now with openmpi. It takes such a long time to build that if I have success with OpenMPI I may not want to bother with rerunning the experiments.

jlewi avatar Jul 19 '24 17:07 jlewi

Fair enough. If building for a specific target architecture, -a native can provide a significant build time reduction.

achartier avatar Jul 19 '24 18:07 achartier

Its a Bug, I also have the issue.

liuweijie19980216 avatar Sep 11 '24 14:09 liuweijie19980216

The issue has disappeared in the latest main branch.

Shixiaowei02 avatar Mar 06 '25 03:03 Shixiaowei02