stack during import tensorrt_llm
System Info
- CPU x86_64
- GPU: L40
- tensorrt_llm: 0.11.0
- CUDA: 12.4
- driver: 535.129.03
- OS: CentOS 7
Who can help?
When I tried to import tensorrt_llm, it got stuck. Through debugging, I found that it was stuck at MpiComm.local_init(), which prompted the following warning (but installing it in the conda virtual environment on other L40 machines also raised the same warning, which can be imported and processed normally), I tested my system's MPI library and found no problem, but I am not sure why this warning is reported:
Python 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorrt_llm
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:
Local host: set-hldy-autovision-fanet-triton08
Device name: mlx5_4
Device vendor ID: 0x02c9
Device vendor part ID: 4125
Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: set-hldy-autovision-fanet-triton08
Local device: mlx5_4
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
Information
- [x] The official example scripts
- [ ] My own modified scripts
Tasks
- [x] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
import tensorrt_llm
Expected behavior
import correctly
actual behavior
stuck during import tensorrt_llm
additional notes
none
Same issue, how do you solve it?
Same issue, how do you solve it?
Through logging, it was found that the system's librt.so, libm.so, etc. could not be found. However, through a find search, it was found that they were in the/lib64 path. Therefore,/lib64 was added to LD_LIBRRY_PATH and recompiled to install mpi4py (provided that the openmpi library has already been installed).
Suggestion: It is best to install through conda to avoid many errors.
I run into hanging as well. Discovered that:
official docker:
$ python -c "import mpi4py; print(mpi4py.__version__)"
4.0.0
my setup:
$ python -c "import mpi4py; print(mpi4py.__version__)"
3.1.4
so clearly they aren't binary compatible
this fixed the hanging:
conda install -c conda-forge mpi4py openmpi
from: https://mpi4py.readthedocs.io/en/latest/install.html#using-conda
now I had the same 4.0.0 version as the docker image.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
Given the extended period of inactivity, I'm closing this issue with the expectation that above comments may have addressed the issue. If you're still experiencing this issue, please free to open a new issue referencing this one. Thank you.