Pernekhan Utemuratov
Pernekhan Utemuratov
+1 Also can't see log_probs returning non-zero in v0.8.0
@symphonylyh I see your suggested code is in v0.8.0, but does it work for you? Could you confirm?
No questions. Thank you.
here is the engine build command: ` trtllm-build --checkpoint_dir /data/tgi-data/trtllm/mixtral-8x7b-tp-4-converted/ --remove_input_padding enable --gpt_attention_plugin float16 --context_fmha enable --gemm_plugin float16 --output_dir /data/tgi-data/trtllm/mixtral-fp16-tp4-engine --paged_kv_cache enable --max_batch_size 64 --max_input_len 32768 --max_output_len 4096 --workers 4...
any updates for this @schetlur-nv ?
@thorjohnsen here is the script and the request file attached. `echo; time curl -Z --parallel-max 64 http://localhost:8000/v2/models/ensemble/generate?[1-64] -d @8k-context-req.txt --output -; echo` Here is the file `8k-context-req.txt`. You can also...
I used the configs from [all_models/inflight_batcher_llm](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm/ensemble/config.pbtxt) with batch_size 64. Here is the script that does what curl is trying to do. ``` import requests import concurrent.futures # Define the URL...
Hi @thorjohnsen were you able to reproduce the issue?