akhoroshev issues

Results 17 issues of


                                            akhoroshev

DeepSeek MoE support

This PR adds support for DeepSeek MoE https://huggingface.co/deepseek-ai/deepseek-moe-16b-base Main differences from Mixtral: 1. Shared experts 2. First layers are dense 3. MoE normalization disabled ![image](https://github.com/NVIDIA/TensorRT-LLM/assets/26367308/9fa7bbf0-9887-4f1d-a55f-7a8a36f4930c) Build: ```bash cd TensorRT-LLM/examples/llama python...

triaged

Fix default min length

When passing 1 to the minimum length I expect at least 1 NOT eof token. But since 1 is the default value `minLengths` tensor is [nullptr](https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/layers/penaltyLayer.cpp#L323) and [penalty](https://github.com/NVIDIA/TensorRT-LLM/blob/a96cccafcf6365c128f004f779160951f8c0801c/cpp/tensorrt_llm/kernels/penaltyKernels.cu#L198) does't work.

[bug] Offloading to host memory leads to error

I'm testing [kv reuse feature](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/kv_cache_reuse.md) Everything works fine until i try to use [offloading to host mem](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/kv_cache_reuse.md#offloading-to-host-memory) I enable offloading by these lines ```c++ optionalParams.kvCacheConfig.hostCacheSize = 40000000000; optionalParams.kvCacheConfig.onboardBlocks = true;...

triaged

need more info

stale

waiting for feedback

Improve documentation about Interruptible operations

SleepFor по умолчанию не прерываемый, InterruptibleSleepFor прерываемый userver::concurrent::SpscQueue::Consumer.Wait() по умолчанию прерываемый, но в документации про это ничего не написано Вводит в заблуждение

ugrpc::InputStream concurrent Read and Finish/FinishWithError

BidirectionalStream stream [supports](https://github.com/userver-framework/userver/blob/develop/grpc/include/userver/ugrpc/server/rpc.hpp#L305) concurrent operations. But InputStream [doesn't support](https://github.com/userver-framework/userver/blob/develop/grpc/include/userver/ugrpc/server/rpc.hpp#L175).

Segfault on main branch (problem in TopP layer)

``` ==== backtrace (tid:1508653) ==== 0 /lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x7f4c5a8dae4c] 1 /lib64/libucs.so.0(+0x2c02c) [0x7f4c5a8db02c] 2 /lib64/libucs.so.0(+0x2c1fa) [0x7f4c5a8db1fa] 3 /lib64/libpthread.so.0(+0x12cf0) [0x7f4c5ca8dcf0] 4 /lib64/libcuda.so.1(+0x18d25c) [0x7f4c5e1d625c] 5 /lib64/libcuda.so.1(+0xe3ee3) [0x7f4c5e12cee3] 6 /lib64/libcuda.so.1(+0x23100c) [0x7f4c5e27a00c] 7 /lib64/libcuda.so.1(+0x4ddc05) [0x7f4c5e526c05] 8...

bug

stale

[bug] --use_paged_context_fmha enable broken

My model is ```json { "mlp_bias": false, "attn_bias": false, "rotary_base": 300000, "rotary_scaling": null, "residual_mlp": false, "disable_weight_only_quant_plugin": false, "moe": { "num_experts": 0, "top_k": 0, "normalization_mode": null, "sparse_mixer_epsilon": 0.01, "tp_mode": 0 },...