Silver267 comments

Results 12 comments of


                                            Silver267

[Request] Support for llama.cpp

according to my tests, llama.cpp with 4bit quantizing is much faster than gpu+ram offload in terms of speed, at least for the 7B model. However, since this is a cpp...

Break models into pieces / safetensors support / keep session state

> Please make 1 issue per suggestion in the future, it is overwhelming to deal with long lists of vague suggestions. Okay, I'll do that in the future. > What...

Break models into pieces / safetensors support / keep session state

Update: convert pytorch to safetensor is implemented [here](https://github.com/Silver267/pytorch-to-safetensor-converter)

Break models into pieces / safetensors support / keep session state

Since most features I originally requested was implemented / partially implemented to the best of ability, I think it's the right time to close this issue.

8-bit precision not working on Windows

bitsandbytes currently does not support windows, but there are some workarounds. This is one of them: https://github.com/TimDettmers/bitsandbytes/issues/30

Implement ZeRO inference

Just saying, but I made a pytorch bin file to safetensor converter that runs locally based on [this](https://huggingface.co/spaces/safetensors/convert) if anyone is interested: [pytorch-to-safetensor-converter](https://github.com/Silver267/pytorch-to-safetensor-converter)

Silver267

[Request] Support for llama.cpp

Break models into pieces / safetensors support / keep session state

Break models into pieces / safetensors support / keep session state

Break models into pieces / safetensors support / keep session state

8-bit precision not working on Windows

Implement ZeRO inference

Implement ZeRO inference

Implement ZeRO inference

New streaming method (much faster)

New streaming method (much faster)