Michał Moskal
Michał Moskal
Right now we take one "file". We should allow multiple to allow for sub-modules and arguments
While at it, also measure mem transfer speed and see how many KV entries can be transferred in a single inference round
Right now (validate this!) the paged attn kernel doesn't take advantage of the fact that a significant part of the prompt may be shared between many queries - probably the...
Investigate what kind of limits the scheduler should enforce - number of tokens, number of KV-entries. What should it target latency for a request?
```python pre = softmax(logits) logits += bias post = softmax(logits) dropped = sum(max(0, pre[i] - post[i]) for i in range(len(post))) ``` if dropped is close to 1 we're going against...
vllm uses max number of possible forks in a sequenace group for scheduling also that max should be limited
Right now the seq id returned in aici_host_self_seq_id() and then via the streaming interface is global to the server. This allows someone to figure out how much an server is...
logits tensor is float16, we use -100 to ban a token. Temperature setting below around `0.0003` causes overflow and the following crash: ``` File "/workspaces/aici/vllm/vllm/model_executor/layers/sampler.py", line 409, in _sample parent_seq_ids,...
both in ModuleRegistry and Stepper - if unused for too long, just delete them