Alex Cheema issues

Results 117 issues of


                                            Alex Cheema

[BUG] All models get stuck on WARMING UP with pipeline/RDMA

## Describe the bug When launching an instance of any model with pipeline and RDMA, it gets stuck on WARMING UP. ## To Reproduce Steps to reproduce the behavior: 1....

bug

Add support for MiniMax 2.1

- [ ] Basic model support with auto parallel with pipeline - [ ] Tensor parallel

enhancement

Add support for GLM-4.7

- [ ] Basic model support (auto parallel with pipeline) - [ ] Tensor parallel

enhancement

Show patch notes in sparkle releases

Sparkle supports a `` tag in the appcast. It supports HTML. We should include our patch notes in each release. See https://sparkle-project.org/documentation/publishing/

enhancement

feat: add continuous batching for concurrent request processing

## Motivation Enable the runner to process multiple concurrent inference requests efficiently. Previously, requests were processed sequentially - one had to complete before the next could start. With continuous batching,...

Add OpenAI-compatible batch API

See https://platform.openai.com/docs/api-reference/batch This is useful for evals.

enhancement

feat: add Claude Messages API and OpenAI Responses API support

## Motivation Add support for Claude Messages API and OpenAI Responses API to allow users to interact with exo using these popular API formats. This enables broader compatibility with existing...

feat: add uncertainty visualization with token-level logprobs

## Motivation Adds uncertainty visualization to the chat interface, allowing users to see token-level confidence scores and regenerate responses from any point in the generation. This enables users to: -...

feat: add prefill progress bar for long prompts

## Motivation Users processing long prompts have no visibility into when token generation will start. This feature adds a progress bar showing prefill progress, giving users real-time feedback during prompt...

Prepend <think> tag to stream for thinking models like GLM-4.7

## Motivation For thinking models like GLM-4.7, the `` tag is inserted by the tokenizer's `apply_chat_template()` into the **prompt** (input). The model generates tokens starting *after* this tag, so ``...