Shuo Yang

Results 3 issues of Shuo Yang

When we apply delta, we load two complete models at the same time, which puts a lot of strain on the CPU memory. This PR allows us to apply delta...

This PR adds support for a subset of OpenAI API features, including completion, create embeddings, and chat completion. With these changes, users will be able to leverage the local LLM...

## Motivation - Support double sparsity (post-training sparse attention) for long context inference in SGLang - See [paper](https://arxiv.org/pdf/2408.07092) ## Modifications - Add triton implementation in `sglang/python/sglang/srt/layers/sparse_decode_attention.py` - Add serving-related parts...