cortex.cpp bug: TensorRT-LLM error

Cortex version

0.5.1-rc2

Describe the Bug

cortex-beta run openhermes-2.5-7b-tensorrt-llm-linux-ada fails with logs below.

Steps to Reproduce

cortex-beta run openhermes-2.5-7b-tensorrt-llm-linux-ada

Screenshots / Logs

20240923 19:58:25.229834 UTC 8237 DEBUG [LoadModel] Reset all resources and states before loading new model - tensorrt-llm_engine.cc:380 20240923 19:58:25.229878 UTC 8237 INFO Reset all resources and states - tensorrt-llm_engine.cc:616 20240923 19:58:25.229884 UTC 8237 DEBUG [LoadModel] n_parallel: 1, batch_size: 16 - tensorrt-llm_engine.cc:388 [TensorRT-LLM][INFO] Set logger level by INFO 20240923 19:58:25.276219 UTC 8237 INFO Successully loaded the tokenizer - tensorrt-llm_engine.h:105 20240923 19:58:25.276238 UTC 8237 INFO Loaded tokenizer from /home/ubuntu/cortexcpp-beta/models/openhermes-2.5-7b-tensorrt-llm-linux-ada/tokenizer.model - tensorrt-llm_engine.cc:414 [TensorRT-LLM][INFO] Engine version 0.11.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Parameter layer_types cannot be read from json: [TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'layer_types' not found [TensorRT-LLM][INFO] Parameter has_position_embedding cannot be read from json: [TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'has_position_embedding' not found [TensorRT-LLM][INFO] Parameter has_token_type_embedding cannot be read from json: [TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'has_token_type_embedding' not found [TensorRT-LLM][INFO] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][INFO] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][INFO] Initializing MPI with thread mode 1 [TensorRT-LLM][INFO] Initialized MPI [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0 20240923 19:58:25.765816 UTC 8237 INFO Loaded config from /home/ubuntu/cortexcpp-beta/models/openhermes-2.5-7b-tensorrt-llm-linux-ada/config.json - tensorrt-llm_engine.cc:421 [TensorRT-LLM][INFO] Engine version 0.11.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Parameter layer_types cannot be read from json: [TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'layer_types' not found [TensorRT-LLM][INFO] Parameter has_position_embedding cannot be read from json: [TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'has_position_embedding' not found [TensorRT-LLM][INFO] Parameter has_token_type_embedding cannot be read from json: [TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'has_token_type_embedding' not found [TensorRT-LLM][INFO] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][INFO] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0 [TensorRT-LLM][INFO] Rank 0 is using GPU 0 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 16 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 16 [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1 [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768 [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 32768 [TensorRT-LLM][INFO] TRTGptModel computeContextLogits: 0 [TensorRT-LLM][INFO] TRTGptModel computeGenerationLogits: 0 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192 [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled [TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None 20240923 19:58:25.890898 UTC 8237 ERROR Failed to load model: [TensorRT-LLM][ERROR] CUDA runtime error in cudaDeviceGetDefaultMemPool(&memPool, device): operation not supported (/home/runner/actions-runner/_work/cortex.tensorrt-llm/cortex.tensorrt-llm/cpp/tensorrt_llm/runtime/bufferManager.cpp:211) 1 0x7f9ee02bfdb3 void tensorrt_llm::common::check<cudaError>(cudaError, char const, char const, int) + 147 2 0x7f9e19d56fe4 tensorrt_llm::runtime::BufferManager::initMemoryPool(int) + 148 3 0x7f9e19d58e5f tensorrt_llm::runtime::BufferManager::BufferManager(std::shared_ptr<tensorrt_llm::runtime::CudaStream>, bool) + 431 4 0x7f9e19e38593 tensorrt_llm::runtime::TllmRuntime::TllmRuntime(tensorrt_llm::runtime::RawEngine const&, nvinfer1::ILogger, float, bool) + 451 5 0x7f9e1a086579 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptrnvinfer1::ILogger, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 937 6 0x7f9e1a0a965b tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 443 7 0x7f9e1a0aee60 tensorrt_llm::executor::Executor::Impl::loadModel(std::optionalstd::filesystem::__cxx11::path const&, std::optional<std::vector<unsigned char, std::allocator > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1408 8 0x7f9e1a0aff6a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::cxx11::path const&, std::optionalstd::filesystem::cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1978 9 0x7f9e1a0a4cee tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 62 10 0x7f9ee02bdcef tensorrtllm::TensorrtllmEngine::LoadModel(std::shared_ptrJson::Value, std::function<void (Json::Value&&, Json::Value&&)>&&) + 3983 11 0x55b453c9b0a6 cortex-beta(+0x2560a6) [0x55b453c9b0a6] 12 0x55b453cabe5c cortex-beta(+0x266e5c) [0x55b453cabe5c] 13 0x55b453cabaef cortex-beta(+0x266aef) [0x55b453cabaef] 14 0x55b453cab858 cortex-beta(+0x266858) [0x55b453cab858] 15 0x55b45428f7d1 cortex-beta(+0x84a7d1) [0x55b45428f7d1] 16 0x55b4541fd9f2 cortex-beta(+0x7b89f2) [0x55b4541fd9f2] 17 0x55b45420b863 cortex-beta(+0x7c6863) [0x55b45420b863] 18 0x55b454209c30 cortex-beta(+0x7c4c30) [0x55b454209c30] 19 0x55b454207b1b cortex-beta(+0x7c2b1b) [0x55b454207b1b] 20 0x55b4541fd28a cortex-beta(+0x7b828a) [0x55b4541fd28a] 21 0x55b4541fcfd9 cortex-beta(+0x7b7fd9) [0x55b4541fcfd9] 22 0x55b4541fc619 cortex-beta(+0x7b7619) [0x55b4541fc619] 23 0x55b4541fbcae cortex-beta(+0x7b6cae) [0x55b4541fbcae] 24 0x55b45420c952 cortex-beta(+0x7c7952) [0x55b45420c952] 25 0x55b45420b024 cortex-beta(+0x7c6024) [0x55b45420b024] 26 0x55b45420911a cortex-beta(+0x7c411a) [0x55b45420911a] 27 0x55b4548cc2ef cortex-beta(+0xe872ef) [0x55b4548cc2ef] 28 0x55b4548be629 cortex-beta(+0xe79629) [0x55b4548be629] 29 0x55b4548bd12d cortex-beta(+0xe7812d) [0x55b4548bd12d] 30 0x55b4548c838c cortex-beta(+0xe8338c) [0x55b4548c838c] 31 0x55b4548c655a cortex-beta(+0xe8155a) [0x55b4548c655a] 32 0x55b4548c51ef cortex-beta(+0xe801ef) [0x55b4548c51ef] 33 0x55b453bbd11a cortex-beta(+0x17811a) [0x55b453bbd11a] 34 0x55b4548b6969 cortex-beta(+0xe71969) [0x55b4548b6969] 35 0x55b4548b6830 cortex-beta(+0xe71830) [0x55b4548b6830] 36 0x55b45489754f cortex-beta(+0xe5254f) [0x55b45489754f] 37 0x55b45489af28 cortex-beta(+0xe55f28) [0x55b45489af28] 38 0x55b45489a9eb cortex-beta(+0xe559eb) [0x55b45489a9eb] 39 0x55b45489baba cortex-beta(+0xe56aba) [0x55b45489baba] 40 0x55b45489ba7d cortex-beta(+0xe56a7d) [0x55b45489ba7d] 41 0x55b45489ba2a cortex-beta(+0xe56a2a) [0x55b45489ba2a] 42 0x55b45489b9fe cortex-beta(+0xe569fe) [0x55b45489b9fe] 43 0x55b45489b9e2 cortex-beta(+0xe569e2) [0x55b45489b9e2] 44 0x7f9ee50f5253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f9ee50f5253] 45 0x7f9ee4d7bac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f9ee4d7bac3] 46 0x7f9ee4e0d850 /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f9ee4e0d850] - tensorrt-llm_engine.cc:439 [TensorRT-LLM][INFO] Engine version 0.11.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][INFO] Parameter layer_types cannot be read from json: [TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'layer_types' not found [TensorRT-LLM][INFO] Parameter has_position_embedding cannot be read from json: [TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'has_position_embedding' not found [TensorRT-LLM][INFO] Parameter has_token_type_embedding cannot be read from json: [TensorRT-LLM][INFO] [json.exception.out_of_range.403] key 'has_token_type_embedding' not found [TensorRT-LLM][INFO] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][INFO] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0 [TensorRT-LLM][INFO] Rank 0 is using GPU 0 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 16 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 16 [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1 [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768 [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 32768 [TensorRT-LLM][INFO] TRTGptModel computeContextLogits: 0 [TensorRT-LLM][INFO] TRTGptModel computeGenerationLogits: 0 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192 [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled [TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None [TensorRT-LLM][INFO] Loaded engine size: 3958 MiB [TensorRT-LLM][ERROR] Error Code: 6: The engine plan file is generated on an incompatible device, expecting compute 5.2 got compute 8.9, please rebuild. [TensorRT-LLM][ERROR] [engine.cpp::deserializeEngine::1233] Error Code 2: Internal Error (Assertion engine->deserialize(start, size, allocator, runtime) failed. ) terminate called after throwing an instance of 'tensorrt_llm::common::TllmException' what(): [TensorRT-LLM][ERROR] CUDA runtime error in cudaDeviceGetDefaultMemPool(&memPool, device): operation not supported (/home/runner/actions-runner/_work/cortex.tensorrt-llm/cortex.tensorrt-llm/cpp/tensorrt_llm/runtime/bufferManager.cpp:258) 1 0x7f9ee02bfdb3 void tensorrt_llm::common::check<cudaError>(cudaError, char const, char const*, int) + 147 2 0x7f9e19d56eb9 tensorrt_llm::runtime::BufferManager::memoryPoolTrimTo(int, unsigned long) + 73 3 0x7f9e180ac791 /home/ubuntu/cortexcpp-beta/engines/cortex.tensorrt-llm/libtensorrt_llm.so(+0x73a791) [0x7f9e180ac791] 4 0x7f9e1a086579 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptrnvinfer1::ILogger, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 937 5 0x7f9e1a0a965b tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 443 6 0x7f9e1a0aee60 tensorrt_llm::executor::Executor::Impl::loadModel(std::optionalstd::filesystem::__cxx11::path const&, std::optional<std::vector<unsigned char, std::allocator > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1408 7 0x7f9e1a0aff6a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::cxx11::path const&, std::optionalstd::filesystem::cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1978 8 0x7f9e1a0a4cee tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 62 9 0x7f9ee02c147a std::_MakeUniq<tensorrt_llm::executor::Executor>::__single_object std::make_unique<tensorrt_llm::executor::Executor, std::__cxx11::basic_string<char, std::char_traits, std::allocator >&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig&>(std::__cxx11::basic_string<char, std::char_traits, std::allocator >&, tensorrt_llm::executor::ModelType&&, tensorrt_llm::executor::ExecutorConfig&) + 138 10 0x7f9ee02a9f57 /home/ubuntu/cortexcpp-beta/engines/cortex.tensorrt-llm/libengine.so(+0x88f57) [0x7f9ee02a9f57] 11 0x55b453c9b0a6 cortex-beta(+0x2560a6) [0x55b453c9b0a6] 12 0x55b453cabe5c cortex-beta(+0x266e5c) [0x55b453cabe5c] 13 0x55b453cabaef cortex-beta(+0x266aef) [0x55b453cabaef] 14 0x55b453cab858 cortex-beta(+0x266858) [0x55b453cab858] 15 0x55b45428f7d1 cortex-beta(+0x84a7d1) [0x55b45428f7d1] 16 0x55b4541fd9f2 cortex-beta(+0x7b89f2) [0x55b4541fd9f2] 17 0x55b45420b863 cortex-beta(+0x7c6863) [0x55b45420b863] 18 0x55b454209c30 cortex-beta(+0x7c4c30) [0x55b454209c30] 19 0x55b454207b1b cortex-beta(+0x7c2b1b) [0x55b454207b1b] 20 0x55b4541fd28a cortex-beta(+0x7b828a) [0x55b4541fd28a] 21 0x55b4541fcfd9 cortex-beta(+0x7b7fd9) [0x55b4541fcfd9] 22 0x55b4541fc619 cortex-beta(+0x7b7619) [0x55b4541fc619] 23 0x55b4541fbcae cortex-beta(+0x7b6cae) [0x55b4541fbcae] 24 0x55b45420c952 cortex-beta(+0x7c7952) [0x55b45420c952] 25 0x55b45420b024 cortex-beta(+0x7c6024) [0x55b45420b024] 26 0x55b45420911a cortex-beta(+0x7c411a) [0x55b45420911a] 27 0x55b4548cc2ef cortex-beta(+0xe872ef) [0x55b4548cc2ef] 28 0x55b4548be629 cortex-beta(+0xe79629) [0x55b4548be629] 29 0x55b4548bd12d cortex-beta(+0xe7812d) [0x55b4548bd12d] 30 0x55b4548c838c cortex-beta(+0xe8338c) [0x55b4548c838c] 31 0x55b4548c655a cortex-beta(+0xe8155a) [0x55b4548c655a] 32 0x55b4548c51ef cortex-beta(+0xe801ef) [0x55b4548c51ef] 33 0x55b453bbd11a cortex-beta(+0x17811a) [0x55b453bbd11a] 34 0x55b4548b6969 cortex-beta(+0xe71969) [0x55b4548b6969] 35 0x55b4548b6830 cortex-beta(+0xe71830) [0x55b4548b6830] 36 0x55b45489754f cortex-beta(+0xe5254f) [0x55b45489754f] 37 0x55b45489af28 cortex-beta(+0xe55f28) [0x55b45489af28] 38 0x55b45489a9eb cortex-beta(+0xe559eb) [0x55b45489a9eb] 39 0x55b45489baba cortex-beta(+0xe56aba) [0x55b45489baba] 40 0x55b45489ba7d cortex-beta(+0xe56a7d) [0x55b45489ba7d] 41 0x55b45489ba2a cortex-beta(+0xe56a2a) [0x55b45489ba2a] 42 0x55b45489b9fe cortex-beta(+0xe569fe) [0x55b45489b9fe] 43 0x55b45489b9e2 cortex-beta(+0xe569e2) [0x55b45489b9e2] 44 0x7f9ee50f5253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f9ee50f5253] 45 0x7f9ee4d7bac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f9ee4d7bac3] 46 0x7f9ee4e0d850 /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f9ee4e0d850]

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

HTTP error: Failed to read connection

What is your OS?

[ ] MacOS
[ ] Windows
[X] Linux

What engine are you running?

[ ] cortex.llamacpp (default)
[X] cortex.tensorrt-llm (Nvidia GPUs)
[ ] cortex.onnx (NPUs, DirectML)

Sep 23 '24 20:09 mafischer

@mafischer: FYI, we will be queuing TensorRT-LLM issues for Sprint 22, i.e. 7-20 Oct. We'll handle this as part of a TensorRT-LLM overhaul sprint

We're currently getting Cortex to beta release and focusing on llama.cpp functionality
Appreciate your patience - we're a small team

Sep 24 '24 01:09 dan-menlo

If I can help in any way, let me know. I specialize NodeJS development. Although it seems like this is on the c++ side from what I can tell.

On Mon, Sep 23, 2024 at 8:39 PM Daniel @.***> wrote:

@mafischer https://github.com/mafischer: FYI, we will be queuing TensorRT-LLM issues for Sprint 22, i.e. 7-20 Oct. We'll handle this as part of a TensorRT-LLM overhaul sprint

We're currently getting Cortex to beta release and focusing on llama.cpp functionality

Appreciate your patience - we're a small team

— Reply to this email directly, view it on GitHub https://github.com/janhq/cortex.cpp/issues/1315#issuecomment-2369930269, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIIGQB2PO7U35GOQJX4DPLZYC7DRAVCNFSM6AAAAABOWZQN5GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRZHEZTAMRWHE . You are receiving this because you were mentioned.Message ID: @.***>

Sep 24 '24 02:09 mafischer

Closing all open Tensorrt-llm stories due to TensorRT-LLM not supporting Desktop Parent issue: https://github.com/janhq/cortex.cpp/issues/1742

Nov 28 '24 07:11 gabrielle-ong