llama.cpp
llama.cpp copied to clipboard
Converting GGML back to Torch checkpoint for HuggingFace/Pytorch consumption/training/finetuning
Here's a PR to convert a model written in a GGML format back to Torch checkpoint for HuggingFace/Pytorch consumption/training/finetuning. Mentioned in issue https://github.com/ggerganov/llama.cpp/issues/359
Also included the ability to use HF's transformers to load the torch model and open up a chat (-c
). The model's generation will be stopped on a newline character so beware of what you are asking 😄 .
Wow, Fantastic. Thank you for this contribution.
I converted a 30b 4bit ggml model https://huggingface.co/Pi3141/alpaca-30B-ggml/tree/main back to pytorch (hf), but the resulting file was 65gb instead of about 20gb
Is it possible for 4bit ggml model to be directly converted to 4bit pytorch model? Im attempting to quantize the 65gb model back to 4bit, but im concerned that quantizing it a second time will further degrade it
I converted a 30b 4bit ggml model https://huggingface.co/Pi3141/alpaca-30B-ggml/tree/main back to pytorch (hf), but the resulting file was 65gb instead of about 20gb
Is it possible for 4bit ggml model to be directly converted to 4bit pytorch model? Im attempting to quantize the 65gb model back to 4bit, but im concerned that quantizing it a second time will further degrade it
I don't think pytorch or HF have the ability to run using 4bit (float). So this file is mainly for you to get float16 weights back so that you can use with other pytorch libraries or training/finetuning using pytorch lightning or HF's transformers
I don't think pytorch or HF have the ability to run using 4bit (float).
ah, your right
there are 4bit models (https://huggingface.co/decapoda-research/llama-13b-hf-int4) in hf format i think, and you can run them using the textgen-webui (https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model) but it does require additional setup to work. i assumed that the conversion from 4bit ggml to 4bit hf (GPTQ) format would be similar/straight forward, but i dont know much about this and could easily be wrong
Well, I suppose they quantize the weights to 4bit then save it as 4bit, which you can do easily with a bit of modification on my code. However, at inference, they need "special support code" to dequantize back to 16 or 32 bits. Basically, it saves a bit of space but it won't save memory. (And I could be wrong on this)
the .pth files == ggml-f16 which both contain the full information. if you have a quantized .pt or ggml-q4_0 / q4_1 , the full information is already lost so you can't transform it back to the unquantized version. i mean you can, but its like saving a full size BMP from a compressed JPG file and makes no sense.
so you can convert pth <-> ggml-f16 , they contain the same information. same with 4bit quantized .pt <-> ggml_q4 , they are in essence the same thing except the quantizing algorithm used can make a difference ofc and you can quantize pth -> pt or ggml-f16 -> ggml_q4 but you cannot "un-quantize" , so going from pt -> pth or ggml-q4 -> ggml-f16 is not something you should do even if you technically could
@anzz1 Thank you for your comment. However, what if you want to study the effect of finetuning on quantized models? Or simply want to look at the distribution of weights of a particular layer before/after quantization? I agree that in a point of view of a normal user, it's not useful, but for researchers or people who wants to understand the effect of different quantization methods, I believe this can be very helpful.
@anzz1 Thank you for your comment. However, what if you want to study the effect of finetuning on quantized models? Or simply want to look at the distribution of weights of a particular layer before/after quantization? I agree that in a point of view of a normal user, it's not useful, but for researchers or people who wants to understand the effect of different quantization methods, I believe this can be very helpful.
Sure thing , I definitely agree. The comment was just for information. And the ability to losslessly go between f32/f16 <-> pytorch bin is definitely good idea so don't have to store both. These models do take quite a space when you start to collect more of them :smile:
@anzz1 @ggerganov Any idea how I can get this PR reviewed/accepted? I am willing to put in more work to make it run correctly and smoothly.
@ductai199x I usually look at all PRs, but sometimes it can take a while. I'll merge this and also invited you as a collaborator
@ggerganov any reason why this was removed from main?
@ggerganov any reason why this was removed from main?
I think it's because some time ago there were lots and lots of breaking changes to the implementation that the old code couldn't keep up. Maybe once everything stablize a bit we should add this capability back?
I see. This feature is extremely useful nowadays but right now I don't have enough room for contributing with a fix.
El mié., 23 ago. 2023 14:37, Tai Duc Nguyen @.***> escribió:
@ggerganov https://github.com/ggerganov any reason why this was removed from main?
I think it's because some time ago there were lots and lots of breaking changes to the implementation that the old code couldn't keep up. Maybe once everything stablize a bit we should add this capability back?
— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/pull/403#issuecomment-1690369119, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIF5HJZG6EV2JA7CHHEKITXWY5UBANCNFSM6AAAAAAWEEV574 . You are receiving this because you commented.Message ID: <ggerganov/llama. @.***>