mistral.rs
mistral.rs copied to clipboard
Sampling on the GPU for as long as possible
Currently, we apply all sampling:
- Sequentially
- On the CPU
This is super slow. This PR is going to refactor the sampling system to do as much sampling work on the GPU, in parallel, as much as possible until we need to copy the final token & logprobs to the CPU. Only then is the final GPU <> CPU sync done.
Code Metrics Report
=============================================================================== Language Files Lines Code Comments Blanks =============================================================================== C Header 2 35 28 0 7 Dockerfile 1 34 25 0 9 Happy 1 442 369 0 73 JSON 11 102 101 0 1 Python 41 1586 1368 46 172 TOML 19 564 498 11 55 ------------------------------------------------------------------------------- Jupyter Notebooks 2 0 0 0 0 |- Markdown 2 77 32 31 14 |- Python 2 196 169 1 26 (Total) 273 201 32 40 ------------------------------------------------------------------------------- Markdown 24 1832 0 1382 450 |- BASH 5 101 98 0 3 |- JSON 1 12 12 0 0 |- Python 5 92 82 0 10 |- Rust 6 407 364 19 24 |- TOML 2 75 63 0 12 (Total) 2519 619 1401 499 ------------------------------------------------------------------------------- Rust 168 54909 49845 983 4081 |- Markdown 90 850 13 787 50 (Total) 55759 49858 1770 4131 =============================================================================== Total 270 59504 52234 2422 4848 ===============================================================================
Pending some resolution of huggingface/candle#2361, otherwise we still have to do a huge GPU <> CPU sync early.