Alex Cheema comments

Results 404 comments of


                                            Alex Cheema

feat: add Claude Messages API and OpenAI Responses API support

Have you had a chance to think about this as that is the bottleneck as far as I can tell @Evanev7

Prepend <think> tag to stream for thinking models like GLM-4.7

Tested manually GLM shows correctly now with the thinking block. Previously there would be no thinking block.

Prepend <think> tag to stream for thinking models like GLM-4.7

## Addressed Reviewer Comments This commit addresses both reviewer concerns: ### 1. Duplicate `apply_chat_template` call removed Previously, `apply_chat_template` was called twice: - Once inside `mlx_generate()` to build the prompt for...

resolve issue #1070

I think the better way to fix this is to auto-select an instance after launching it. If we delete one that is selected, then we should select the most recently...

resolve issue #1070

Tested this change. Launching an instance does *not* select it in the model dropdown. How to reproduce: - Launch an instance of Qwen 0.6B 4-bit - It gets auto-selected (correct)...

resolve issue #1070

- Launch an instance of Qwen 0.6B 4-bit - It gets auto-selected (correct) - Chat with Qwen 0.6B 4-bit - It returns correctly using Qwen 0.6B 4-bit (correct) - Delete...

Run Sparkle updater in background to make macOS app work offline.

This doesn't seem to fix anything.

fix local network warning

Also please run `nix flake check` and `nix fmt`

feat: add continuous batching for concurrent request processing

Moving back to draft. Needs some further work.

feat: add continuous batching for concurrent request processing

## Code Review — PR #1153: feat: add continuous batching for concurrent request processing **CI Status**: All checks passing (typecheck, build on aarch64-darwin, x86_64-linux, aarch64-linux). --- ### Overview This PR...