Eric Buehler issues

Results 136 issues of


                                            Eric Buehler

Accelerate quantization with Marlin kernel or HQQ

The [Marlin INT4xFP16](https://github.com/IST-DASLab/marlin) CUDA matmul kernel can achieve ~4x speed improvement over CUTLASS matmul. See also: [hqq](https://github.com/mobiusml/hqq/) as a quantization method which supports Marlin and other optimized kernels, without calibration...

new feature

optimization

Synchronize device when mixtral expert splitting

Refs #352.

Add the T5 seq2seq model

This PR implements our first Seq2Seq model, T5. Refs #384.

Support loading tokenizer from `sentencepiece` model

Currently, if a sentencepiece `.model` file is provided, the user must run a provided script to convert into the equivalent `tokenizer.json`. By supporting `sentencepiece` models directly, we can avoid this...

new feature

Add tracking of memory usage

This will implement memory usage tracking. This will be used for #377. - [x] CPU - [ ] CUDA: https://docs.rs/cudarc/latest/cudarc/driver/result/fn.mem_get_info.html - [ ] Metal

Store and load prefix cache on disk

This PR enables storing and then restoring the model-specific prefix cache on disk. The intended use case, paired with #350, is to accelerate few-shot learning use cases by allowing a...

optimization

Allow subsets of sequences in prefix cacher

Refs #347.

new feature

optimization

Eric Buehler

Accelerate quantization with Marlin kernel or HQQ

Synchronize device when mixtral expert splitting

Add the T5 seq2seq model

Support loading tokenizer from `sentencepiece` model

Add tracking of memory usage

Store and load prefix cache on disk

Allow subsets of sequences in prefix cacher

CUDA_ERROR_NO_DEVICE "no CUDA-capable device is detected"

Saving Phi 3 vision fails due to tensor sharing

Documentation and optimization of X-LoRA