akhoroshev
akhoroshev
https://github.com/NVIDIA/TensorRT-LLM/blob/b57221b764bc579cbb2490154916a871f620e2c4/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionLaunch.h#L56 and https://github.com/NVIDIA/TensorRT-LLM/blob/b57221b764bc579cbb2490154916a871f620e2c4/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderMaskedMultiheadAttentionTemplate.h#L1309 must be the same name
I play with `examples/cpp/gpt/gpt_example.cc` and found that generation of tokens does't finish when the first EOD is reached. This is my gpt_config.ini. I use gpt2 model which was converted into...
Плоские json'ы в логах + JsonString для LogExtra
When i convert llama model with fp16 precision ```bash python convert_checkpoint.py --model_dir /llama_dir --dtype float16 --output_dir /llama_dir_trt trtllm-build --checkpoint_dir /llama_dir_trt --output_dir /llama_dir_trt_build --max_batch_size 64 --max_input_len 7168 --max_output_len 1024 --max_num_tokens 32768...
Воспроизведение: ```c++ #include #include #include #include using namespace userver; UTEST_MT(Queue, RaceCondition, 16) { struct Foo {}; using ResponseQueue = userver::concurrent::SpscQueue; auto queue = ResponseQueue::Create(); auto consumer = queue->GetConsumer(); auto producer...
Executor api introduces Leader and Orchestrator modes. Leader works via mpi. How Orchestrator mode is implemented? Does it uses mpi itself? Which mode is preferable for performance: Leader or Orchestrator?
This [document](https://github.com/NVIDIA/TensorRT-LLM/blob/118b3d7e7bab720d8ea9cd95338da60f7512c93a/docs/source/inference_request.md?plain=1#L16) describes tensor datatypes for [GptManager InferenceRequest](https://github.com/NVIDIA/TensorRT-LLM/blob/118b3d7e7bab720d8ea9cd95338da60f7512c93a/cpp/include/tensorrt_llm/batch_manager/inferenceRequest.h) My question is: what kind of memory is needed for these tensors? Pinned/Pagable/Device? I can't find information about this.
1. Build mixtral for tp8 2. Run `mpirun -n 8 ./gptSessionBenchmark` 3. nvidia-smi shows ``` Wed May 15 09:13:31 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4...
Current arch of tensorrtllm means that logits processor (for both [executor](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/executor/types.h#L55) and [batch_manager](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/include/tensorrt_llm/batch_manager/llmRequest.h#L58)) will be called independently for **each request**. But it is bad approach in terms of performance. For...
According to the [docs](https://github.com/NVIDIA/TensorRT-LLM/blob/5d8ca2faf74c494f220c8f71130340b513eea9a9/docs/source/kv_cache_reuse.md?plain=1#L50) reusable blocks are evicted based on LRU. LRU is good approach. But I know that for some queries (promts) they won't be reused and I want...