Ruihang Lai
Ruihang Lai
Previously, when a Relay function contains a Call which directly uses Tuples as arguments (the example below), ``` %25 = (%23, %24) /* ty=(Tensor[(1, 160), float32], Tensor[(1, 160), float32]) */;...
**Core infrastructure: ExprVisitor / ExprMutator** - For simple rewriting passes (e.g., FMA-rewrite, CallTIR-rewrite), one can get the post-written Call directly from the old Call. ```cpp class EwiseFMARewriter : public ExprMutator...
This PR updates the Android build-from-source instructions. Some dependency steps were missed previously.
Hi, I noticed that for the V1.0 version, the 7B and 13B models use different conversation prompts. I am wondering that this time for the WizardLM-13B-V1.2 model, what prompt should...
Following #1805, this PR supports Gemma model in MLC Serve. _Still working in progress for tests and examples._
## Overview This issue tracks the support of RoPE scaling, one important configurable parameter adopted by many new models, in MLC LLM. ## Action Items - [ ] Support linear...
This PR bumps FlashInfer version to support manually configure the kernels being built in config.cmake. Prior to this PR, the kernels being built is hardcoded in FlashInfer header files.
This PR fixes a bug in the PagedKVCache which may happen when the sequence removal order is not consistent with the reverse order of sequence add/fork order. With this fix,...
This PR introduces the benchmark support for fixed request rates. Specifically, * We introduced the `AttachRequestRateTimestamp` request processor which attaches timestamps according to the specified request rate with regarding to...
The warp reduction implemented by "shuffle down" primitive takes a mask denoting the active threads within the warp that participate in this shuffle. Previously we compute the mask, while in...