TensorRT-LLM failed to use TensorRT-LLM/examples/apps/fastapi

trafficstars

run inference with /TensorRT-LLM/examples/run.py , it's ok mpirun -n 4 -allow-run-as-root python3 /load/trt_llm/TensorRT-LLM/examples/run.py
--input_text "hello，who are you?"
--max_output_len=50
--tokenizer_dir /load/Qwen1.5-32B-Chat/
--engine_dir=/load/output/trt_llm/trt_engines_qw32/f16_sq0.5_4gpu/

but failed to use TensorRT-LLM/examples/apps/fastapi_server.py CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m apps.fastapi_server /load/output/trt_llm/trt_engines_qw32/f16_sq0.5_4gpu/ /load/Qwen1.5-32B-Chat/ --port 9001 --tp_size 4 i modified the code to pass the tokenizer_dir

curl http://0.0.0.0:9001/generate -d '{"prompt": "hello，who are you?"}', then no response and gpu memory usage is nearly 0

Jul 24 '24 10:07 AGI-player

Hi @AGI-player , what's tensorrt_llm version? Could you please try the main branch?

Jul 25 '24 00:07 QiJune

Hi @AGI-player , what's tensorrt_llm version? Could you please try the main branch?

apt-get update && apt-get -y install python 3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com/ python3 -c "import tensorrt_llm", "[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024071600"

pip list：

Package Version Editable project location

absl-py 2.1.0 accelerate 0.33.0 aiofiles 23.2.1 aiohappyeyeballs 2.3.2 aiohttp 3.10.0b1 aiohttp-sse-client 0.2.1 aiosignal 1.3.1 altair 5.3.0 annotated-types 0.7.0 anyio 4.4.0 async-timeout 4.0.3 attrs 23.2.0 build 1.2.1 certifi 2024.7.4 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 colorama 0.4.6 colored 2.2.4 coloredlogs 15.0.1 contourpy 1.2.1 cuda-python 12.5.0 cycler 0.12.1 datasets 2.16.1 diffusers 0.29.2 dill 0.3.7 distro 1.9.0 dnspython 2.6.1 einops 0.8.0 email_validator 2.2.0 evaluate 0.4.2 exceptiongroup 1.2.2 fastapi 0.111.1 fastapi-cli 0.0.4 ffmpy 0.3.2 filelock 3.15.4 fonttools 4.53.1 frozenlist 1.4.1 fschat 0.2.36 /load/FastChat fsspec 2023.10.0 gradio 4.36.0 gradio_client 1.0.1 h11 0.14.0 h5py 3.10.0 httpcore 1.0.5 httptools 0.6.1 httpx 0.27.0 huggingface-hub 0.24.1 humanfriendly 10.0 idna 3.7 importlib_metadata 8.1.0 importlib_resources 6.4.0 janus 1.0.0 Jinja2 3.1.4 joblib 1.4.2 jsonschema 4.23.0 jsonschema-specifications 2023.12.1 kiwisolver 1.4.5 lark 1.1.9 latex2mathml 3.77.0 Markdown 3.6 markdown-it-py 3.0.0 markdown2 2.5.0 MarkupSafe 2.1.5 matplotlib 3.9.1 mdtex2html 1.3.0 mdurl 0.1.2 mpi4py 3.1.6 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.15 networkx 3.3 nh3 0.2.18 ninja 1.11.1.1 nltk 3.8.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-modelopt 0.13.1 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.5.82 nvidia-nvtx-cu12 12.1.105 onnx 1.16.1 openai 1.37.0 optimum 1.21.2 orjson 3.10.6 packaging 24.1 pandas 2.2.2 pillow 10.3.0 pip 22.0.2 polygraphy 0.49.9 prompt_toolkit 3.0.47 protobuf 5.28.0rc1 psutil 6.0.0 PuLP 2.9.0 pyarrow 17.0.0 pyarrow-hotfix 0.6 pydantic 2.8.2 pydantic_core 2.20.1 pydub 0.25.1 Pygments 2.18.0 pynvml 11.5.3 pyparsing 3.1.2 pyproject_hooks 1.1.0 python-dateutil 2.9.0.post0 python-dotenv 1.0.1 python-multipart 0.0.9 pytz 2024.1 PyYAML 6.0.2rc1 referencing 0.35.1 regex 2024.5.15 requests 2.32.3 rich 13.7.1 rouge_score 0.1.2 rpds-py 0.19.0 ruff 0.5.4 safetensors 0.4.3 scipy 1.14.0 semantic-version 2.10.0 sentencepiece 0.1.99 setuptools 59.6.0 shellingham 1.5.4 shortuuid 1.0.13 six 1.16.0 sniffio 1.3.1 sse-starlette 2.1.2 starlette 0.37.2 StrEnum 0.4.15 svgwrite 1.4.3 sympy 1.13.1 tensorrt-cu12 10.1.0 tensorrt-cu12-bindings 10.1.0 tensorrt-cu12-libs 10.1.0 tensorrt-llm 0.12.0.dev2024071600 tiktoken 0.7.0 tokenizers 0.19.1 tomli 2.0.1 tomlkit 0.12.0 toolz 0.12.1 torch 2.3.1 tqdm 4.66.4 transformers 4.42.4 transformers-stream-generator 0.0.5 triton 2.3.1 typer 0.12.3 typing_extensions 4.12.2 tzdata 2024.1 urllib3 2.2.2 uvicorn 0.30.3 uvloop 0.19.0 watchfiles 0.22.0 wavedrom 2.0.3.post3 wcwidth 0.2.13 websockets 11.0.3 wheel 0.43.0 xxhash 3.4.1 yarl 1.9.4 zipp 3.19.2

Jul 25 '24 01:07 AGI-player

Hi @AGI-player , what's tensorrt_llm version? Could you please try the main branch?

hello , @QiJune, according to https://github.com/mpi4py/mpi4py/discussions/491 , i try this CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m mpi4py.futures /load/trt_llm/TensorRT-LLM/examples/apps/fastapi_server.py /load/output/trt_llm/trt_engines_qw32/f16_sq0.5_4gpu/ /load/Qwen1.5-32B-Chat/ --port 9001 --tp_size 4

after sending curl http://0.0.0.0:9001/generate -d '{"prompt": "hello，who are you?"}', errors as follow. it seems the MPI wordsize is wrong, any solutions?

INFO:datasets:PyTorch version 2.3.1 available. DEBUG:h5py._conv:Creating converter from 7 to 5 DEBUG:h5py._conv:Creating converter from 5 to 7 DEBUG:h5py._conv:Creating converter from 7 to 5 DEBUG:h5py._conv:Creating converter from 5 to 7 [TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024071600 INFO:root:Starting server at 0.0.0.0:9001 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. DEBUG:asyncio:Using selector: EpollSelector INFO: Started server process [1437657] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:9001 (Press CTRL+C to quit) [TensorRT-LLM][INFO] Engine version 0.12.0.dev2024071600 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.12.0.dev2024071600 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.12.0.dev2024071600 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Engine version 0.12.0.dev2024071600 found in the config file, assuming engine(s) built by new builder API. INFO: 127.0.0.1:43504 - "POST /generate HTTP/1.1" 500 Internal Server Error Traceback (most recent call last): File "/env/trt_llm/lib/python3.10/site-packages/tensorrt_llm/hlapi/utils.py", line 28, in wrapper return func(args, **kwargs) File "/env/trt_llm/lib/python3.10/site-packages/tensorrt_llm/executor.py", line 686, in workers_main raise e File "/env/trt_llm/lib/python3.10/site-packages/tensorrt_llm/executor.py", line 683, in workers_main executor = ExecutorBindingsWorker(engine_dir, executor_config) File "/env/trt_llm/lib/python3.10/site-packages/tensorrt_llm/executor.py", line 425, in init self.engine = tllm.Executor(engine_dir, RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: With communicationMode kLEADER, MPI worldSize is expected to be equal to tppp when participantIds are not specified (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/llm/cpp/tensorrt_llm/executor/executorImpl.cpp:461) 1 0x7f5098d0fdc1 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 82 2 0x7f5098d3fa8d /env/trt_llm/lib/python3.10/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x75ca8d) [0x7f5098d3fa8d] 3 0x7f509aa4b408 tensorrt_llm::executor::Executor::Impl::initializeCommAndWorkers(int, int, tensorrt_llm::executor::ExecutorConfig const&, std::optional<tensorrt_llm::executor::ModelType>, std::optionalstd::filesystem::path const&) + 1048 4 0x7f509aa4b9d3 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optionalstd::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1299 5 0x7f509aa40680 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 64 6 0x7f5113e66e22 /env/trt_llm/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb6e22) [0x7f5113e66e22] 7 0x7f5113e0949c /env/trt_llm/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x5949c) [0x7f5113e0949c] 8 0x55d225efec9e python3(+0x15ac9e) [0x55d225efec9e] 9 0x55d225ef53cb _PyObject_MakeTpCall + 603 10 0x55d225f0d3eb python3(+0x1693eb) [0x55d225f0d3eb] 11 0x55d225f0df58 _PyObject_Call + 280 12 0x55d225f09c87 python3(+0x165c87) [0x55d225f09c87] 13 0x55d225ef577b python3(+0x15177b) [0x55d225ef577b] 14 0x7f5113e08abb /env/trt_llm/lib/python3.10/site-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x58abb) [0x7f5113e08abb] 15 0x55d225ef53cb _PyObject_MakeTpCall + 603 16 0x55d225eee63b _PyEval_EvalFrameDefault + 29931 17 0x55d225ef4564 _PyObject_FastCallDictTstate + 196 18 0x55d225f09664 python3(+0x165664) [0x55d225f09664] 19 0x55d225ef536c _PyObject_MakeTpCall + 508 20 0x55d225eed96b _PyEval_EvalFrameDefault + 26651 21 0x55d225eff59c _PyFunction_Vectorcall + 124 22 0x55d225f0ddb2 PyObject_Call + 290 23 0x55d225f0dd4b PyObject_Call + 187 24 0x55d225ee9a9d _PyEval_EvalFrameDefault + 10573 25 0x55d225eff59c _PyFunction_Vectorcall + 124 26 0x55d225f0ddb2 PyObject_Call + 290 27 0x55d225ee9a9d _PyEval_EvalFrameDefault + 10573 28 0x55d225eff59c _PyFunction_Vectorcall + 124 29 0x55d225ee9a9d _PyEval_EvalFrameDefault + 10573 30 0x55d225eff59c _PyFunction_Vectorcall + 124 31 0x55d225ee796e _PyEval_EvalFrameDefault + 2078 32 0x55d225eff59c _PyFunction_Vectorcall + 124 33 0x55d225ee796e _PyEval_EvalFrameDefault + 2078 34 0x55d225f0d371 python3(+0x169371) [0x55d225f0d371] 35 0x55d22603657a python3(+0x29257a) [0x55d22603657a] 36 0x55d22602b978 python3(+0x287978) [0x55d22602b978] 37 0x7f52db3f0ac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f52db3f0ac3] 38 0x7f52db482850 /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f52db482850] ERROR: Exception in ASGI application Traceback (most recent call last): File "/env/trt_llm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi result = await app( # type: ignore[func-returns-value] File "/env/trt_llm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in call return await self.app(scope, receive, send) File "/env/trt_llm/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in call await super().call(scope, receive, send) File "/env/trt_llm/lib/python3.10/site-packages/starlette/applications.py", line 123, in call await self.middleware_stack(scope, receive, send) File "/env/trt_llm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in call raise exc File "/env/trt_llm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in call await self.app(scope, receive, _send) File "/env/trt_llm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in call await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/env/trt_llm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/env/trt_llm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/env/trt_llm/lib/python3.10/site-packages/starlette/routing.py", line 756, in call await self.middleware_stack(scope, receive, send) File "/env/trt_llm/lib/python3.10/site-packages/starlette/routing.py", line 776, in app await route.handle(scope, receive, send) File "/env/trt_llm/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/env/trt_llm/lib/python3.10/site-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/env/trt_llm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/env/trt_llm/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/env/trt_llm/lib/python3.10/site-packages/starlette/routing.py", line 72, in app response = await func(request) File "/env/trt_llm/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app raw_response = await run_endpoint_function( File "/env/trt_llm/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(**values) File "/load/trt_llm/TensorRT-LLM/examples/apps/fastapi_server.py", line 57, in generate promise = self.llm.generate_async(prompt, File "/env/trt_llm/lib/python3.10/site-packages/tensorrt_llm/hlapi/llm.py", line 195, in generate_async result = self._executor.generate_async( File "/env/trt_llm/lib/python3.10/site-packages/tensorrt_llm/executor.py", line 311, in generate_async result = self.submit( File "/env/trt_llm/lib/python3.10/site-packages/tensorrt_llm/executor.py", line 755, in submit self.start() File "/env/trt_llm/lib/python3.10/site-packages/tensorrt_llm/executor.py", line 729, in start raise RuntimeError("worker initialization failed") RuntimeError: worker initialization failed

Jul 25 '24 03:07 AGI-player

Hi @byshiue, can this problem be solved? any solutions?

Aug 02 '24 02:08 AGI-player

I noticed that you passed an engine into the server, it seems that the tp_size setting during engine building is different than the settings to the server:

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m mpi4py.futures /load/trt_llm/TensorRT-LLM/examples/apps/fastapi_server.py /load/output/trt_llm/trt_engines_qw32/f16_sq0.5_4gpu/ /load/Qwen1.5-32B-Chat/ --port 9001 --tp_size 4

Can you try to build the engine with tp_size of 4, just to align this with the server arguments above?

@AGI-player

Aug 14 '24 12:08 Superjomn

I noticed that you passed an engine into the server, it seems that the tp_size setting during engine building is different than the settings to the server:

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m mpi4py.futures /load/trt_llm/TensorRT-LLM/examples/apps/fastapi_server.py /load/output/trt_llm/trt_engines_qw32/f16_sq0.5_4gpu/ /load/Qwen1.5-32B-Chat/ --port 9001 --tp_size 4

Can you try to build the engine with tp_size of 4, just to align this with the server arguments above?

@AGI-player

actually, the engine was built with tp_size of 4 ......

Aug 15 '24 11:08 AGI-player

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

Sep 16 '24 02:09 github-actions[bot]

This issue was closed because it has been stalled for 15 days with no activity.

Oct 01 '24 02:10 github-actions[bot]

RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: With communicationMode kLEADER, MPI worldSize is expected to be equal to tppp when participantIds are not specified

In the error, it should be the error about runtime world size not aligned with the engine's, it could be the version mismatch between the engine building phase and the server phase.

You can try to build the engine and run the example with the latest TensorRT-LLM version.

Nov 04 '24 05:11 Superjomn

RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: With communicationMode kLEADER, MPI worldSize is expected to be equal to tppp when participantIds are not specified

In the error, it should be the error about runtime world size not aligned with the engine's, it could be the version mismatch between the engine building phase and the server phase.

You can try to build the engine and run the example with the latest TensorRT-LLM version.

@Superjomn i try the latest version( 0.15.0.dev2024102900) with Qwen1.5-14B-Chat, but failed to build engines,

CUDA_VISIBLE_DEVICES=6,7 python3 /load/trt_llm/v0.15/TensorRT-LLM/examples/qwen/convert_checkpoint.py --model_dir /load/Qwen1.5-14B-Chat/ --output_dir /load/output/trt_llm/Qwen1.5-14B-Chat/f16_sq0.5_2gpu/ckpts --dtype float16 --smoothquant 0.5 --per_channel --per_token --calib_dataset /load/data/train.json --tp_size 2

CUDA_VISIBLE_DEVICES=6,7 trtllm-build --checkpoint_dir /load/output/trt_llm/Qwen1.5-14B-Chat/f16_sq0.5_2gpu/ckpts --output_dir /load/output/trt_llm/Qwen1.5-14B-Chat/f16_sq0.5_2gpu/engines --max_batch_size 128 --max_input_len 8192 --max_seq_len 8192 --gemm_plugin float16 --gpt_attention_plugin float16 --paged_kv_cache enable --remove_input_padding enable --context_fmha enable --multiple_profiles enable --max_num_tokens 49152 --workers 2

it failed to build engines, errors as follow:

[TensorRT-LLM] TensorRT-LLM version: 0.15.0.dev2024102900 [11/04/2024-16:33:54] [TRT-LLM] [W] Option --paged_kv_cache is deprecated, use --kv_cache_type=paged/disabled instead. [11/04/2024-16:33:54] [TRT-LLM] [I] Set bert_attention_plugin to auto. [11/04/2024-16:33:54] [TRT-LLM] [I] Set gpt_attention_plugin to float16. [11/04/2024-16:33:54] [TRT-LLM] [I] Set gemm_plugin to float16. [11/04/2024-16:33:54] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [11/04/2024-16:33:54] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None. [11/04/2024-16:33:54] [TRT-LLM] [I] Set nccl_plugin to auto. [11/04/2024-16:33:54] [TRT-LLM] [I] Set lora_plugin to None. [11/04/2024-16:33:54] [TRT-LLM] [I] Set moe_plugin to auto. [11/04/2024-16:33:54] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [11/04/2024-16:33:54] [TRT-LLM] [I] Set low_latency_gemm_plugin to None. [11/04/2024-16:33:54] [TRT-LLM] [I] Set context_fmha to True. [11/04/2024-16:33:54] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False. [11/04/2024-16:33:54] [TRT-LLM] [I] Set remove_input_padding to True. [11/04/2024-16:33:54] [TRT-LLM] [I] Set reduce_fusion to False. [11/04/2024-16:33:54] [TRT-LLM] [I] Set enable_xqa to True. [11/04/2024-16:33:54] [TRT-LLM] [I] Set tokens_per_block to 64. [11/04/2024-16:33:54] [TRT-LLM] [I] Set use_paged_context_fmha to False. [11/04/2024-16:33:54] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [11/04/2024-16:33:54] [TRT-LLM] [I] Set multiple_profiles to True. [11/04/2024-16:33:54] [TRT-LLM] [I] Set paged_state to True. [11/04/2024-16:33:54] [TRT-LLM] [I] Set streamingllm to False. [11/04/2024-16:33:54] [TRT-LLM] [I] Set use_fused_mlp to True. [11/04/2024-16:33:54] [TRT-LLM] [I] Set pp_reduce_scatter to False. [11/04/2024-16:33:54] [TRT-LLM] [I] Set paged_kv_cache to True. [11/04/2024-16:33:54] [TRT-LLM] [W] Implicitly setting QWenConfig.qwen_type = qwen2 [11/04/2024-16:33:54] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_intermediate_size = 0 [11/04/2024-16:33:54] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_shared_expert_intermediate_size = 0 [11/04/2024-16:33:54] [TRT-LLM] [I] Compute capability: (8, 0) [11/04/2024-16:33:54] [TRT-LLM] [I] SM count: 108 [11/04/2024-16:33:54] [TRT-LLM] [I] SM clock: 1410 MHz [11/04/2024-16:33:54] [TRT-LLM] [I] int4 TFLOPS: 1247 [11/04/2024-16:33:54] [TRT-LLM] [I] int8 TFLOPS: 623 [11/04/2024-16:33:54] [TRT-LLM] [I] fp8 TFLOPS: 0 [11/04/2024-16:33:54] [TRT-LLM] [I] float16 TFLOPS: 311 [11/04/2024-16:33:54] [TRT-LLM] [I] bfloat16 TFLOPS: 311 [11/04/2024-16:33:54] [TRT-LLM] [I] float32 TFLOPS: 155 [11/04/2024-16:33:54] [TRT-LLM] [I] Total Memory: 80 GiB [11/04/2024-16:33:54] [TRT-LLM] [I] Memory clock: 1593 MHz [11/04/2024-16:33:54] [TRT-LLM] [I] Memory bus width: 5120 [11/04/2024-16:33:54] [TRT-LLM] [I] Memory bandwidth: 2039 GB/s [11/04/2024-16:33:54] [TRT-LLM] [I] NVLink is active: True [11/04/2024-16:33:54] [TRT-LLM] [I] NVLink version: 3 [11/04/2024-16:33:54] [TRT-LLM] [I] NVLink bandwidth: 300 GB/s [TensorRT-LLM] TensorRT-LLM version: 0.15.0.dev2024102900 [TensorRT-LLM] TensorRT-LLM version: 0.15.0.dev2024102900 [11/04/2024-16:33:58] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/env/trt_llm_15/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 657, in load param.value = weights[name] File "/env/trt_llm_15/lib/python3.10/site-packages/tensorrt_llm/parameter.py", line 201, in value assert v.shape == self.shape,
AssertionError: The value updated is not the same shape as the original. Updated: (2560, 13696), original: (5120, 6848)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/root/miniforge3/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/env/trt_llm_15/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 390, in build_and_save engine = build_model(build_config, File "/env/trt_llm_15/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 360, in build_model model = model_cls.from_checkpoint(ckpt_dir, config=rank_config) File "/env/trt_llm_15/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 632, in from_checkpoint model.load(weights, from_pruned=is_checkpoint_pruned) File "/env/trt_llm_15/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 659, in load raise RuntimeError( RuntimeError: Encounter error 'The value updated is not the same shape as the original. Updated: (2560, 13696), original: (5120, 6848)' for parameter 'transformer.layers.0.mlp.proj.weight' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/env/trt_llm_15/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 440, in parallel_build future.result() File "/root/miniforge3/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/root/miniforge3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception RuntimeError: Encounter error 'The value updated is not the same shape as the original. Updated: (2560, 13696), original: (5120, 6848)' for parameter 'transformer.layers.0.mlp.proj.weight' [11/04/2024-16:33:58] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to DataType.FLOAT concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/env/trt_llm_15/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 657, in load param.value = weights[name] File "/env/trt_llm_15/lib/python3.10/site-packages/tensorrt_llm/parameter.py", line 201, in value assert v.shape == self.shape,
AssertionError: The value updated is not the same shape as the original. Updated: (2560, 13696), original: (5120, 6848)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/root/miniforge3/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/env/trt_llm_15/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 390, in build_and_save engine = build_model(build_config, File "/env/trt_llm_15/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 360, in build_model model = model_cls.from_checkpoint(ckpt_dir, config=rank_config) File "/env/trt_llm_15/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 632, in from_checkpoint model.load(weights, from_pruned=is_checkpoint_pruned) File "/env/trt_llm_15/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 659, in load raise RuntimeError( RuntimeError: Encounter error 'The value updated is not the same shape as the original. Updated: (2560, 13696), original: (5120, 6848)' for parameter 'transformer.layers.0.mlp.proj.weight' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/env/trt_llm_15/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 440, in parallel_build future.result() File "/root/miniforge3/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/root/miniforge3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception RuntimeError: Encounter error 'The value updated is not the same shape as the original. Updated: (2560, 13696), original: (5120, 6848)' for parameter 'transformer.layers.0.mlp.proj.weight' Traceback (most recent call last): File "/env/trt_llm_15/bin/trtllm-build", line 8, in sys.exit(main()) File "/env/trt_llm_15/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 602, in main parallel_build(model_config, ckpt_dir, build_config, args.output_dir, File "/env/trt_llm_15/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 444, in parallel_build assert len(exceptions AssertionError: Engine building failed, please check error log.

Nov 04 '24 08:11 AGI-player

@AGI-player could you please try to use the latest main branch?

Nov 21 '24 15:11 hello-11

TensorRT-LLM TensorRT-LLM copied to clipboard

failed to use TensorRT-LLM/examples/apps/fastapi_server.py

TensorRT-LLM
TensorRT-LLM copied to clipboard