trt-llm-rag-windows icon indicating copy to clipboard operation
trt-llm-rag-windows copied to clipboard

Dtype value read by app.py empty. Gets error message : Unsupported dtype

Open RoslinAdama opened this issue 9 months ago • 2 comments

I am trying to use Trt_llm rag with Mistral 7B model. I have used int8 weight-only quantization during the building of the TRT engine. The app launches but drops an error when an input is passed to the chat :

tensorrt_llm/runtime/generation.py", line 834, in dtype
return str_dtype_to_torch(self._model_config.dtype)
tensorrt_llm/_utils.py", line 149, in str_dtype_to_torch
assert ret is not None, f'Unsupported dtype: {dtype}'

The reason is dtype is empty (== ""). This may be due to an error in the reading of the config file.

Here is config.json for the engine :

{ "version": "0.9.0.dev2024040900", "pretrained_config": { "architecture": "MistralForCausalLM", "dtype": "float16", "logits_dtype": "float32", "vocab_size": 32000, "max_position_embeddings": 32768, "hidden_size": 4096, "num_hidden_layers": 32, "num_attention_heads": 32, "num_key_value_heads": 8, "head_size": 128, "hidden_act": "silu", "intermediate_size": 14336, "norm_epsilon": 1e-05, "position_embedding_type": "rope_gpt_neox", "use_parallel_embedding": false, "embedding_sharding_dim": 0, "share_embedding_table": false, "mapping": { "world_size": 1, "tp_size": 1, "pp_size": 1 }, "quantization": { "quant_algo": "W8A16", "kv_cache_quant_algo": null, "group_size": 128, "smoothquant_val": null, "has_zero_point": false, "pre_quant_scale": false, "exclude_modules": [ "lm_head" ] }, "kv_dtype": "float16", "rotary_scaling": null, "moe_normalization_mode": null, "rotary_base": 1000000.0, "moe_num_experts": 0, "moe_top_k": 0, "moe_tp_mode": 2, "attn_bias": false, "disable_weight_only_quant_plugin": false, "mlp_bias": false }, "build_config": { "max_input_len": 1024, "max_output_len": 1024, "max_batch_size": 1, "max_beam_width": 1, "max_num_tokens": 1024, "opt_num_tokens": 1, "max_prompt_embedding_table_size": 0, "gather_context_logits": false, "gather_generation_logits": false, "strongly_typed": false, "builder_opt": null, "profiling_verbosity": "layer_names_only", "enable_debug_output": false, "max_draft_len": 0, "use_refit": false, "input_timing_cache": null, "output_timing_cache": "model.cache", "lora_config": { "lora_dir": [], "lora_ckpt_source": "hf", "max_lora_rank": 64, "lora_target_modules": [], "trtllm_modules_to_hf_modules": {} }, "auto_parallel_config": { "world_size": 1, "gpus_per_node": 8, "cluster_key": "A40", "cluster_info": null, "sharding_cost_model": "alpha_beta", "comm_cost_model": "alpha_beta", "enable_pipeline_parallelism": false, "enable_shard_unbalanced_shape": false, "enable_shard_dynamic_shape": false, "enable_reduce_scatter": true, "builder_flags": null, "debug_mode": false, "infer_shape": true, "validation_mode": false, "same_buffer_io": { "past_key_value_(\d+)": "present_key_value_\1" }, "same_spec_io": {}, "sharded_io_allowlist": [ "past_key_value_\d+", "present_key_value_\d*" ], "fast_reduce": true, "fill_weights": false, "parallel_config_cache": null, "profile_cache": null, "dump_path": null, "debug_outputs": [] }, "weight_sparsity": false, "use_fused_mlp": false, "plugin_config": { "bert_attention_plugin": "float16", "gpt_attention_plugin": "float16", "gemm_plugin": "float16", "smooth_quant_gemm_plugin": null, "identity_plugin": null, "layernorm_quantization_plugin": null, "rmsnorm_quantization_plugin": null, "nccl_plugin": null, "lookup_plugin": null, "lora_plugin": null, "weight_only_groupwise_quant_matmul_plugin": null, "weight_only_quant_matmul_plugin": "float16", "quantize_per_token_plugin": false, "quantize_tensor_plugin": false, "moe_plugin": "float16", "mamba_conv1d_plugin": "float16", "context_fmha": true, "context_fmha_fp32_acc": false, "paged_kv_cache": true, "remove_input_padding": true, "use_custom_all_reduce": true, "multi_block_mode": false, "enable_xqa": true, "attention_qk_half_accumulation": false, "tokens_per_block": 128, "use_paged_context_fmha": false, "use_fp8_context_fmha": false, "use_context_fmha_for_generation": false, "multiple_profiles": false, "paged_state": true, "streamingllm": false } } }

RoslinAdama avatar Apr 30 '24 08:04 RoslinAdama

Seems that some values from the config are not read. Therefore, the default values are loaded. That is why dtype is empty. Several values are not loaded from the config :

  • dtype
  • max_batch_size
  • max_beam_width

RoslinAdama avatar May 06 '24 15:05 RoslinAdama

Before integrating the model in the chatrtx can you please do the basic inference testing using trt-llm. There are examples script in the tensorrt-llm git repo, you can follow the instructions : https://github.com/NVIDIA/TensorRT-LLM/tree/v0.9.0/examples/llama. INT8 example is also there

anujj avatar May 23 '24 09:05 anujj