Alex Cheema

Results 404 comments of Alex Cheema

Have you had a chance to think about this as that is the bottleneck as far as I can tell @Evanev7

Tested manually GLM shows correctly now with the thinking block. Previously there would be no thinking block.

## Addressed Reviewer Comments This commit addresses both reviewer concerns: ### 1. Duplicate `apply_chat_template` call removed Previously, `apply_chat_template` was called twice: - Once inside `mlx_generate()` to build the prompt for...

I think the better way to fix this is to auto-select an instance after launching it. If we delete one that is selected, then we should select the most recently...

Tested this change. Launching an instance does *not* select it in the model dropdown. How to reproduce: - Launch an instance of Qwen 0.6B 4-bit - It gets auto-selected (correct)...

- Launch an instance of Qwen 0.6B 4-bit - It gets auto-selected (correct) - Chat with Qwen 0.6B 4-bit - It returns correctly using Qwen 0.6B 4-bit (correct) - Delete...

Also please run `nix flake check` and `nix fmt`

Moving back to draft. Needs some further work.

## Code Review — PR #1153: feat: add continuous batching for concurrent request processing **CI Status**: All checks passing (typecheck, build on aarch64-darwin, x86_64-linux, aarch64-linux). --- ### Overview This PR...