unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'name' not found
System Info
8*RTX4090, 24G tensorrt_llm version: 0.11.0.dev2024051400
Who can help?
@T
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
export HF_LLAMA_MODEL=/network/model/Meta-Llama-3-8B export ENGINE_PATH=/network/engine/engine_outputs_llama3_8B python3 tools/fill_template.py -i llama_ifb/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:64,preprocessing_instance_count:1 python3 tools/fill_template.py -i llama_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:64,postprocessing_instance_count:1 python3 tools/fill_template.py -i llama_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False python3 tools/fill_template.py -i llama_ifb/ensemble/config.pbtxt triton_max_batch_size:64 python3 tools/fill_template.py -i llama_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0
pip install SentencePiece python3 scripts/launch_triton_server.py --world_size 8 --model_repo=llama_ifb/
Expected behavior
The triton server can be run correctly!
actual behavior
When I deploy llama3 (Meta-Llama-3-8B) in 8*RTX4090, it raises below error:
how to fix it? thanks~
additional notes
NO
When I add fake key of "name" in config.json, it raises below error:
I meet the same problem also.
I meet the same problem also.
Do you encounter similar issue on LLaMA2?