Eric Buehler
Eric Buehler
Currently, `QTensor::quantize`: - Take a tensor, assume it is on the GPU for this example - Copies the data to the CPU - Quantizes on the CPU - Copies the...
@p-e-w, could you please give the implementation a quick check? I'm not sure if you are familiar with Rust, but I ported the algorithm from the oobabooga implemenation you linked....
Currently, our messages API is clunky as we need to support the older OpenAI format as well as the new, multimodal format (for Idefics and Llava). This is exposed in...
With the recent advent of large models (take Llama 3.1 405b, for example!), distributed inference support is a must! We currently support naive device mapping, which works by allowing a...
Currently, we apply all sampling: - Sequentially - On the CPU This is super slow. This PR is going to refactor the sampling system to do as much sampling work...
Refs #555. @KaQuMiQ I added some debug statements to get a better picture of what's going on. Can you please install from source: (assuming you have Rust installed, which I...