Silver267

Results 12 comments of Silver267

according to my tests, llama.cpp with 4bit quantizing is much faster than gpu+ram offload in terms of speed, at least for the 7B model. However, since this is a cpp...

> Please make 1 issue per suggestion in the future, it is overwhelming to deal with long lists of vague suggestions. Okay, I'll do that in the future. > What...

Update: convert pytorch to safetensor is implemented [here](https://github.com/Silver267/pytorch-to-safetensor-converter)

Since most features I originally requested was implemented / partially implemented to the best of ability, I think it's the right time to close this issue.

bitsandbytes currently does not support windows, but there are some workarounds. This is one of them: https://github.com/TimDettmers/bitsandbytes/issues/30

Just saying, but I made a pytorch bin file to safetensor converter that runs locally based on [this](https://huggingface.co/spaces/safetensors/convert) if anyone is interested: [pytorch-to-safetensor-converter](https://github.com/Silver267/pytorch-to-safetensor-converter)

@81300 Thanks for the information! Though the code doesn't seem to support ram offload (my vram is 8gb), it would still be a useful reference.

Since ZeRO inference is implemented and seems to be working, closing this issue. Please open another issue if there are other problems.

Suggestion: When running inference of llama 13b using this branch, I've encountered OOM issue when running the command python server.py --cai-chat --auto-devices --gpu-memory "3", which never occurred using the main...