Batched sampling across concurrent requests (step toward #1)

Open Bennethxyz opened this issue 2 months ago • 0 comments

This PR introduces batched sampling to reduce per-token overhead when multiple requests reach the last shard concurrently.\n\nWhat\n- Add an async batcher in Node to group sampling calls within a short window (default 5ms) or until (default 8).\n- Stack logits and call once for the batch; on failure, fall back to per-request sampling.\n- Emit per-request token callbacks and forward the sampled token for continued generation, preserving current behavior.\n\nWhy\n- Incremental progress toward full forward-pass batching requested in #1 ([BOUNTY - ] Batched Requests). Sampling is a measurable hotspot and can benefit from batching with minimal risk.\n\nNotes\n- No changes to public APIs or gRPC schema; fully backward compatible.\n- Future work: extend batching earlier in the pipeline (prompt encode and forward passes) with per-request caches combined into batch-aware caches.\n\nConfig\n- (default 8)\n- (default 5)\n\nI’m happy to iterate on full tensor-forward batching next (MLX/Tinygrad cache semantics).

Oct 22 '25 09:10 Bennethxyz