akhoroshev issues

Results 17 issues of


                                            akhoroshev

[bug] MMHA_USE_FP32_ACUM_FOR_LOGITS and MMHA_USE_FP32_ACCUM_FOR_LOGITS

https://github.com/NVIDIA/TensorRT-LLM/blob/b57221b764bc579cbb2490154916a871f620e2c4/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionLaunch.h#L56 and https://github.com/NVIDIA/TensorRT-LLM/blob/b57221b764bc579cbb2490154916a871f620e2c4/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionTemplate.h#L1309 must be the same name

triaged

Stop the generation if the eod is reached

I play with `examples/cpp/gpt/gpt_example.cc` and found that generation of tokens does't finish when the first EOD is reached. This is my gpt_config.ini. I use gpt2 model which was converted into...

feat: json logging

Плоские json'ы в логах + JsonString для LogExtra

Detected layernorm nodes in FP16.

When i convert llama model with fp16 precision ```bash python convert_checkpoint.py --model_dir /llama_dir --dtype float16 --output_dir /llama_dir_trt trtllm-build --checkpoint_dir /llama_dir_trt --output_dir /llama_dir_trt_build --max_batch_size 64 --max_input_len 7168 --max_output_len 1024 --max_num_tokens 32768...

triaged

SpscQueue race condition bug

Воспроизведение: ```c++ #include #include #include #include using namespace userver; UTEST_MT(Queue, RaceCondition, 16) { struct Foo {}; using ResponseQueue = userver::concurrent::SpscQueue; auto queue = ResponseQueue::Create(); auto consumer = queue->GetConsumer(); auto producer...

Question about Orchestrator mode

Executor api introduces Leader and Orchestrator modes. Leader works via mpi. How Orchestrator mode is implemented? Does it uses mpi itself? Which mode is preferable for performance: Leader or Orchestrator?

triaged

Memory type of sampling params

This [document](https://github.com/NVIDIA/TensorRT-LLM/blob/118b3d7e7bab720d8ea9cd95338da60f7512c93a/docs/source/inference_request.md?plain=1#L16) describes tensor datatypes for [GptManager InferenceRequest](https://github.com/NVIDIA/TensorRT-LLM/blob/118b3d7e7bab720d8ea9cd95338da60f7512c93a/cpp/include/tensorrt_llm/batch_manager/inferenceRequest.h) My question is: what kind of memory is needed for these tensors? Pinned/Pagable/Device? I can't find information about this.

question

triaged

stale

mpirun extra gpu memory consumption

1. Build mixtral for tp8 2. Run `mpirun -n 8 ./gptSessionBenchmark` 3. nvidia-smi shows ``` Wed May 15 09:13:31 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4...

triaged

[feature request] logits processor perfomance issue

Current arch of tensorrtllm means that logits processor (for both [executor](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/executor/types.h#L55) and [batch_manager](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/batch_manager/llmRequest.h#L58)) will be called independently for **each request**. But it is bad approach in terms of performance. For...

feature request

[Feature request] kv cache reuse policy feature request

According to the [docs](https://github.com/NVIDIA/TensorRT-LLM/blob/5d8ca2faf74c494f220c8f71130340b513eea9a9/docs/source/kv_cache_reuse.md?plain=1#L50) reusable blocks are evicted based on LRU. LRU is good approach. But I know that for some queries (promts) they won't be reused and I want...

feature request