Pernekhan Utemuratov comments

Results 8 comments of


                                            Pernekhan Utemuratov

trafficstars

Question: Return log probabilites

+1 Also can't see log_probs returning non-zero in v0.8.0

Question: Return log probabilites

@symphonylyh I see your suggested code is in v0.8.0, but does it work for you? Could you confirm?

When will FP8 be available for Mixtral?

No questions. Thank you.

Crashes for long context requests

here is the engine build command: ` trtllm-build --checkpoint_dir /data/tgi-data/trtllm/mixtral-8x7b-tp-4-converted/ --remove_input_padding enable --gpt_attention_plugin float16 --context_fmha enable --gemm_plugin float16 --output_dir /data/tgi-data/trtllm/mixtral-fp16-tp4-engine --paged_kv_cache enable --max_batch_size 64 --max_input_len 32768 --max_output_len 4096 --workers 4...

Crashes for long context requests

any updates for this @schetlur-nv ?

Crashes for long context requests

@thorjohnsen here is the script and the request file attached. `echo; time curl -Z --parallel-max 64 http://localhost:8000/v2/models/ensemble/generate?[1-64] -d @8k-context-req.txt --output -; echo` Here is the file `8k-context-req.txt`. You can also...

Crashes for long context requests

I used the configs from [all_models/inflight_batcher_llm](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm/ensemble/config.pbtxt) with batch_size 64. Here is the script that does what curl is trying to do. ``` import requests import concurrent.futures # Define the URL...

Crashes for long context requests

Hi @thorjohnsen were you able to reproduce the issue?