trt-llm-rag-windows Dtype value read by app.py empty. Gets error message : Unsupported dtype

I am trying to use Trt_llm rag with Mistral 7B model. I have used int8 weight-only quantization during the building of the TRT engine. The app launches but drops an error when an input is passed to the chat :

tensorrt_llm/runtime/generation.py", line 834, in dtype
return str_dtype_to_torch(self._model_config.dtype)
tensorrt_llm/_utils.py", line 149, in str_dtype_to_torch
assert ret is not None, f'Unsupported dtype: {dtype}'

The reason is dtype is empty (== ""). This may be due to an error in the reading of the config file.

Here is config.json for the engine :

{ "version": "0.9.0.dev2024040900", "pretrained_config": { "architecture": "MistralForCausalLM", "dtype": "float16", "logits_dtype": "float32", "vocab_size": 32000, "max_position_embeddings": 32768, "hidden_size": 4096, "num_hidden_layers": 32, "num_attention_heads": 32, "num_key_value_heads": 8, "head_size": 128, "hidden_act": "silu", "intermediate_size": 14336, "norm_epsilon": 1e-05, "position_embedding_type": "rope_gpt_neox", "use_parallel_embedding": false, "embedding_sharding_dim": 0, "share_embedding_table": false, "mapping": { "world_size": 1, "tp_size": 1, "pp_size": 1 }, "quantization": { "quant_algo": "W8A16", "kv_cache_quant_algo": null, "group_size": 128, "smoothquant_val": null, "has_zero_point": false, "pre_quant_scale": false, "exclude_modules": [ "lm_head" ] }, "kv_dtype": "float16", "rotary_scaling": null, "moe_normalization_mode": null, "rotary_base": 1000000.0, "moe_num_experts": 0, "moe_top_k": 0, "moe_tp_mode": 2, "attn_bias": false, "disable_weight_only_quant_plugin": false, "mlp_bias": false }, "build_config": { "max_input_len": 1024, "max_output_len": 1024, "max_batch_size": 1, "max_beam_width": 1, "max_num_tokens": 1024, "opt_num_tokens": 1, "max_prompt_embedding_table_size": 0, "gather_context_logits": false, "gather_generation_logits": false, "strongly_typed": false, "builder_opt": null, "profiling_verbosity": "layer_names_only", "enable_debug_output": false, "max_draft_len": 0, "use_refit": false, "input_timing_cache": null, "output_timing_cache": "model.cache", "lora_config": { "lora_dir": [], "lora_ckpt_source": "hf", "max_lora_rank": 64, "lora_target_modules": [], "trtllm_modules_to_hf_modules": {} }, "auto_parallel_config": { "world_size": 1, "gpus_per_node": 8, "cluster_key": "A40", "cluster_info": null, "sharding_cost_model": "alpha_beta", "comm_cost_model": "alpha_beta", "enable_pipeline_parallelism": false, "enable_shard_unbalanced_shape": false, "enable_shard_dynamic_shape": false, "enable_reduce_scatter": true, "builder_flags": null, "debug_mode": false, "infer_shape": true, "validation_mode": false, "same_buffer_io": { "past_key_value_(\d+)": "present_key_value_\1" }, "same_spec_io": {}, "sharded_io_allowlist": [ "past_key_value_\d+", "present_key_value_\d*" ], "fast_reduce": true, "fill_weights": false, "parallel_config_cache": null, "profile_cache": null, "dump_path": null, "debug_outputs": [] }, "weight_sparsity": false, "use_fused_mlp": false, "plugin_config": { "bert_attention_plugin": "float16", "gpt_attention_plugin": "float16", "gemm_plugin": "float16", "smooth_quant_gemm_plugin": null, "identity_plugin": null, "layernorm_quantization_plugin": null, "rmsnorm_quantization_plugin": null, "nccl_plugin": null, "lookup_plugin": null, "lora_plugin": null, "weight_only_groupwise_quant_matmul_plugin": null, "weight_only_quant_matmul_plugin": "float16", "quantize_per_token_plugin": false, "quantize_tensor_plugin": false, "moe_plugin": "float16", "mamba_conv1d_plugin": "float16", "context_fmha": true, "context_fmha_fp32_acc": false, "paged_kv_cache": true, "remove_input_padding": true, "use_custom_all_reduce": true, "multi_block_mode": false, "enable_xqa": true, "attention_qk_half_accumulation": false, "tokens_per_block": 128, "use_paged_context_fmha": false, "use_fp8_context_fmha": false, "use_context_fmha_for_generation": false, "multiple_profiles": false, "paged_state": true, "streamingllm": false } } }

Apr 30 '24 08:04 RoslinAdama

Seems that some values from the config are not read. Therefore, the default values are loaded. That is why dtype is empty. Several values are not loaded from the config :

dtype
max_batch_size
max_beam_width

May 06 '24 15:05 RoslinAdama

Before integrating the model in the chatrtx can you please do the basic inference testing using trt-llm. There are examples script in the tensorrt-llm git repo, you can follow the instructions : https://github.com/NVIDIA/TensorRT-LLM/tree/v0.9.0/examples/llama. INT8 example is also there

May 23 '24 09:05 anujj

trt-llm-rag-windows trt-llm-rag-windows copied to clipboard

Dtype value read by app.py empty. Gets error message : Unsupported dtype

trt-llm-rag-windows
trt-llm-rag-windows copied to clipboard