Ruihang Lai issues

Results 10 issues of


                                            Ruihang Lai

[Translator] Support translating op calls with Tuple input

Previously, when a Relay function contains a Call which directly uses Tuples as arguments (the example below), ``` %25 = (%23, %24) /* ty=(Tensor[(1, 160), float32], Tensor[(1, 160), float32]) */;...

[DISCUSS] Relax Pass Writing Paradigm

**Core infrastructure: ExprVisitor / ExprMutator** - For simple rewriting passes (e.g., FMA-rewrite, CallTIR-rewrite), one can get the post-written Call directly from the old Call. ```cpp class EwiseFMARewriter : public ExprMutator...

Update Android insturcions for build from source

This PR updates the Android build-from-source instructions. Some dependency steps were missed previously.

Conversation prompt for WizardLM-13B-V1.2

Hi, I noticed that for the V1.0 version, the 7B and 13B models use different conversation prompts. I am wondering that this time for the WizardLM-13B-V1.2 model, what prompt should...

[Serving] Support Gemma for serving

Following #1805, this PR supports Gemma model in MLC Serve. _Still working in progress for tests and examples._

[Tracking] RoPE scaling support

## Overview This issue tracks the support of RoPE scaling, one important configurable parameter adopted by many new models, in MLC LLM. ## Action Items - [ ] Support linear...

status: tracking

[CMake] FlashInfer bump and cmake updates

This PR bumps FlashInfer version to support manually configure the kernels being built in config.cmake. Prior to this PR, the kernels being built is hardcoded in FlashInfer header files.

[Runtime] Fix PagedKVCache for PopN and enhance tests

This PR fixes a bug in the PagedKVCache which may happen when the sequence removal order is not consistent with the reverse order of sequence add/fork order. With this fix,...

[Bench] Support benchmarking for fixed request rates

This PR introduces the benchmark support for fixed request rates. Specifically, * We introduced the `AttachRequestRateTimestamp` request processor which attaches timestamps according to the specified request rate with regarding to...

[Fix][TIR] LowerThreadAllreduce warp reduction mask

The warp reduction implemented by "shuffle down" primitive takes a mask denoting the active threads within the warp that participate in this shuffle. Previously we compute the mask, while in...