Kirat Pandya comments

Results 16 comments of


                                            Kirat Pandya

Investigate PagedAttention KV-cache memory management for faster inference

#2813 only covers "same prompt, multiple output", not "multiple prompt, multiple output".

Docker Issus ''Illegal instruction''

> On a project with a million dependencies an libraries this might be a problem, but as there is no dependencies and builds on anything and thus compilation shouldn't pose...

[Feature Request] Per request sampling params

Beyond sampling parameters, the following would be very helpful 1. Prompt token counts: Makes it easier to potentially trim the next request 2. logprobs - Extremely useful for scenarios like...

[Feature Request] Per request sampling params

1. Yes. idea would be to get the actual token counts for prompt and completion (something like this: https://platform.openai.com/docs/api-reference/making-requests) 2. Yes 3. That is fine

TableStructureModel initialization fails: "Cannot copy out of meta tensor" when using CPU device

+1 Running into this. We run Docling inside a GRPC server which requires a ThreadPoolExecutor, so moving to ProcessPoolExecutor is not an option (at least not a straightforward one)

llama : combined beam search + grammar sampling strategy

@ggerganov a bunch of these cool thee toys (speculative exec, beam search) seem to be landing in either main or separate executables in examples. Do you intend to push for...