exo Fix: Pass missing KV cache constants to generation loop (#1036)

Fixes #1036.

Currently, the memory management constants MAX_KV_SIZE and KEEP_KV_SIZE are defined in constants.py but are never passed to the make_kv_cache function in generate.py. This results in the model defaulting to an unlimited cache size, ignoring the defined safety limits and potentially causing Out-Of-Memory (OOM) errors during long-context generation.

Changes Imports: Imported MAX_KV_SIZE and KEEP_KV_SIZE from exo.worker.engines.mlx.constants into generate.py.

Logic Update: Updated make_kv_cache calls in both warmup_inference and mlx_generate to explicitly pass these constants.

Type Safety: Added an or 0 fallback for keep_kv_size. This resolves a type mismatch where constants.py defines it as int | None but utils_mlx.py expects a strict int.

Why It Works The make_kv_cache function in utils_mlx.py contains logic to initialize a RotatingKVCache (which enforces limits) only when max_kv_size is provided. By passing these arguments, we ensure the engine respects the memory constraints defined in the project constants instead of defaulting to an unbounded KVCache.

Test Plan Manual Testing Hardware: Windows / Code Inspection (No Apple Silicon available locally)

Static Analysis: Verified that the function signature in utils_mlx.py matches the arguments being passed.

Type Checking: Confirmed that KEEP_KV_SIZE or 0 correctly satisfies the type requirement for the keep parameter.

Request for Reviewer: Since I cannot run the mlx runtime locally, I request a reviewer with Apple Silicon to run a generation and verify the logs output: INFO: Using rotating KV cache... (confirming the arguments were received).

Automated Testing N/A - This is a logic fix to ensure existing constants are utilized. Validated via static code analysis.

Jan 02 '26 14:01 aryanyk

On second thought, we don't actually want RotatingKVCache to be the default, in which case we should set these constants to None.

I would like to see some numbers on size of KV cache and when it would cause OOMs before considering making this default.

Jan 04 '26 01:01 AlexCheema

Thanks for the review @AlexCheema! You are absolutely right—forcing a Rotating Cache by default might degrade performance for users who rely on long context, and I don't want to risk regressions without benchmarks.

Jan 31 '26 17:01 aryanyk

Thank you for your contribution! The exo codebase has undergone significant architectural changes since this PR was opened (event sourcing, new placement system, runner rewrite), and this PR now has merge conflicts that would require a substantial rewrite to resolve.

We're closing this to keep the PR list focused, but the idea is still welcome — if you'd like to revisit this, please feel free to open a fresh PR against the current main branch. Thanks again for your time and effort!

Feb 17 '26 17:02 AlexCheema