Shuo Yang issues

Repositories
Issues
Comments

Results 3 issues of


                                            Shuo Yang

[WIP] Reduce peak memory to 8 GB

When we apply delta, we load two complete models at the same time, which puts a lot of strain on the CPU memory. This PR allows us to apply delta...

Support more OpenAI-compatible APIs (embedding, completion)

This PR adds support for a subset of OpenAI API features, including completion, create embeddings, and chat completion. With these changes, users will be able to leverage the local LLM...

[WIP] Support double sparsity

## Motivation - Support double sparsity (post-training sparse attention) for long context inference in SGLang - See [paper](https://arxiv.org/pdf/2408.07092) ## Modifications - Add triton implementation in `sglang/python/sglang/srt/layers/sparse_decode_attention.py` - Add serving-related parts...