Michał Moskal

Results 73 issues of Michał Moskal

Right now we take one "file". We should allow multiple to allow for sub-modules and arguments

While at it, also measure mem transfer speed and see how many KV entries can be transferred in a single inference round

rLLM

Right now (validate this!) the paged attn kernel doesn't take advantage of the fact that a significant part of the prompt may be shared between many queries - probably the...

rLLM

Investigate what kind of limits the scheduler should enforce - number of tokens, number of KV-entries. What should it target latency for a request?

rLLM

```python pre = softmax(logits) logits += bias post = softmax(logits) dropped = sum(max(0, pre[i] - post[i]) for i in range(len(post))) ``` if dropped is close to 1 we're going against...

vllm uses max number of possible forks in a sequenace group for scheduling also that max should be limited

Right now the seq id returned in aici_host_self_seq_id() and then via the streaming interface is global to the server. This allows someone to figure out how much an server is...

logits tensor is float16, we use -100 to ban a token. Temperature setting below around `0.0003` causes overflow and the following crash: ``` File "/workspaces/aici/vllm/vllm/model_executor/layers/sampler.py", line 409, in _sample parent_seq_ids,...

both in ModuleRegistry and Stepper - if unused for too long, just delete them