TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

stack during import tensorrt_llm

Open Howe-Young opened this issue 1 year ago • 3 comments

System Info

  • CPU x86_64
  • GPU: L40
  • tensorrt_llm: 0.11.0
  • CUDA: 12.4
  • driver: 535.129.03
  • OS: CentOS 7

Who can help?

When I tried to import tensorrt_llm, it got stuck. Through debugging, I found that it was stuck at MpiComm.local_init(), which prompted the following warning (but installing it in the conda virtual environment on other L40 machines also raised the same warning, which can be imported and processed normally), I tested my system's MPI library and found no problem, but I am not sure why this warning is reported:

Python 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorrt_llm
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            set-hldy-autovision-fanet-triton08
  Device name:           mlx5_4
  Device vendor ID:      0x02c9
  Device vendor part ID: 4125

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           set-hldy-autovision-fanet-triton08
  Local device:         mlx5_4
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------

Information

  • [x] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

import tensorrt_llm

Expected behavior

import correctly

actual behavior

stuck during import tensorrt_llm

additional notes

none

Howe-Young avatar Aug 30 '24 12:08 Howe-Young

Same issue, how do you solve it?

huweim avatar Sep 18 '24 01:09 huweim

Same issue, how do you solve it?

Through logging, it was found that the system's librt.so, libm.so, etc. could not be found. However, through a find search, it was found that they were in the/lib64 path. Therefore,/lib64 was added to LD_LIBRRY_PATH and recompiled to install mpi4py (provided that the openmpi library has already been installed).

Suggestion: It is best to install through conda to avoid many errors.

Howe-Young avatar Sep 19 '24 03:09 Howe-Young

I run into hanging as well. Discovered that:

official docker:

$ python -c "import mpi4py; print(mpi4py.__version__)"
4.0.0

my setup:

$ python -c "import mpi4py; print(mpi4py.__version__)"
3.1.4

so clearly they aren't binary compatible

this fixed the hanging:

conda install -c conda-forge mpi4py openmpi

from: https://mpi4py.readthedocs.io/en/latest/install.html#using-conda

now I had the same 4.0.0 version as the docker image.

stas00 avatar Oct 04 '24 02:10 stas00

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

github-actions[bot] avatar Nov 14 '24 02:11 github-actions[bot]

Given the extended period of inactivity, I'm closing this issue with the expectation that above comments may have addressed the issue. If you're still experiencing this issue, please free to open a new issue referencing this one. Thank you.

karljang avatar Aug 26 '25 20:08 karljang