Michał Moskal issues

Results 73 issues of


                                            Michał Moskal

allow JSON-encoded "files" to pyctrl/jsctrl

Right now we take one "file". We should allow multiple to allow for sub-modules and arguments

rewrite promptlib to use pyctrl not declctrl

implement separate stream for memory copy

While at it, also measure mem transfer speed and see how many KV entries can be transferred in a single inference round

rLLM

prompt sharing for faster page attn

Right now (validate this!) the paged attn kernel doesn't take advantage of the fact that a significant part of the prompt may be shared between many queries - probably the...

rLLM

scheduler

Investigate what kind of limits the scheduler should enforce - number of tokens, number of KV-entries. What should it target latency for a request?

rLLM

provide feedback on the probability mass dropped by logit bias

```python pre = softmax(logits) logits += bias post = softmax(logits) dropped = sum(max(0, pre[i] - post[i]) for i in range(len(post))) ``` if dropped is close to 1 we're going against...

have aici_init() return max number of forks

vllm uses max number of possible forks in a sequenace group for scheduling also that max should be limited

SeqId should start at 0 for every group

Right now the seq id returned in aici_host_self_seq_id() and then via the streaming interface is global to the server. This allows someone to figure out how much an server is...

very low temperature causes crash in sampling

logits tensor is float16, we use -100 to ban a token. Temperature setting below around `0.0003` causes overflow and the following crash: ``` File "/workspaces/aici/vllm/vllm/model_executor/layers/sampler.py", line 409, in _sample parent_seq_ids,...

gc unused module instances

both in ModuleRegistry and Stepper - if unused for too long, just delete them