unsloth icon indicating copy to clipboard operation
unsloth copied to clipboard

Failed at model.save_pretrained_gguf

Open ch-tseng opened this issue 10 months ago โ€ข 2 comments

I use the model: https://huggingface.co/taide/TAIDE-LX-7B-Chat to fine-tune, but always got the error. training is OK, but model.save_pretrained_gguf failed.

==((====))== Unsloth: Fast Llama patching release 2024.4 \ /| GPU: NVIDIA GeForce RTX 3090. Max memory: 23.691 GB. Platform = Linux. O^O/ _/ \ Pytorch: 2.2.2+cu121. CUDA = 8.6. CUDA Toolkit = 12.1. \ / Bfloat16 = TRUE. Xformers = 0.0.25.post1. FA = False. "--" Free Apache license: http://github.com/unslothai/unsloth Loading checkpoint shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 3/3 [00:03<00:00, 1.02s/it] You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers Unsloth 2024.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers. Setting pad_token_id to eos_token_id:2 for open-end generation. ['Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n\n\n### Input:\nไธŠไธŠ็ฆฎๆ‹œๆŒ็บŒๅ‡บ็พ้ ญ็—›ใ€ๅ™ๅฟƒใ€้ ญๆšˆ็š„็—‡็‹€๏ผŒๆœ‰ๆ™‚็กไธ€ไธ‹่ตทไพ†้‚„ๆ˜ฏๆฒ’ๆœ‰็ทฉ่งฃ๏ผŒๅคงๆฆ‚้ƒฝ็—›ๅœจๅคช้™ฝ็ฉดไธŠ้ขไธ€้ปž๏ผŒๆœ‰ๆ™‚็—›ๅœจ้ ญ่…ฆๅ‹บ๏ผˆ่ผƒๅฐ‘๏ผ‰๏ผŒ่บบ่‘—่ตทไพ†้ ญๆšˆ้ ป็Ž‡่ถŠไพ†่ถŠ้ซ˜๏ผˆๆœฌ่บซๆœ‰่ฒง่ก€๏ผŒไฝ†่ฟ‘ๆœŸๅช่ฆๅงฟๅ‹ขไธ€่ฝ‰ๆ›ๅฐฑๆœƒ้ ญๆšˆ็œผๅ‰ๆŽฅ่ฟ‘้ป‘่‰ฒ๏ผ‰๏ผŒๅฎนๆ˜“็–ฒ็ดฏ๏ผŒๆƒณๅ•ไธ€ไธ‹้€™ไบ›็—‡็‹€ๆœ‰้œ€่ฆๅˆฐ้†ซ้™ขๅŽปๆชขๆŸฅๅ—Ž๏ผŸ\n\n### Response:\n\nๆ‚จๅฅฝ๏ผš\nๆ นๆ“šๆ‚จ็š„ๆ่ฟฐ๏ผŒๆ‚จๅฏ่ƒฝๆœ‰ไปฅไธ‹ๅนพ็จฎๅฏ่ƒฝ็š„ๅŽŸๅ› ๏ผš\n1. ่ฒง่ก€๏ผš่ฒง่ก€ๆ˜ฏๅธธ่ฆ‹็š„ๅ•้กŒ๏ผŒ่‹ฅๆฒ’ๆœ‰ๅฎšๆœŸๆชขๆŸฅ๏ผŒๅฏ่ƒฝๆœƒๅฐŽ่‡ด้ ญๆšˆใ€้ ญ็—›ใ€็–ฒ็ดฏ็ญ‰็—‡็‹€ใ€‚\n2. ๅ…ง่€ณๅ•้กŒ๏ผšๅ…ง่€ณๆœ‰ๅนณ่กกๅ™จๅฎ˜๏ผŒ่‹ฅๅ…ง่€ณๆœ‰ๅ•้กŒ๏ผŒๅฏ่ƒฝๆœƒๅฐŽ่‡ด้ ญๆšˆใ€้ ญ็—›ใ€ๅ™ๅฟƒ็ญ‰็—‡็‹€ใ€‚\n3. ๅ…ถไป–็–พ็—…๏ผšๅฆ‚็”ฒ็‹€่…บ็–พ็—…ใ€ๅฟƒ่‡Ÿ็–พ็—…ใ€็ณ–ๅฐฟ็—…ใ€้ซ˜่ก€ๅฃ“็ญ‰๏ผŒ้ƒฝๅฏ่ƒฝๆœƒๅผ•่ตท้ ญๆšˆใ€้ ญ็—›ใ€ๅ™ๅฟƒ็ญ‰็—‡็‹€ใ€‚\nๅปบ่ญฐๆ‚จๅ‰ๅพ€้†ซ้™ข๏ผŒ่ฎ“้†ซๅธซ็‚บๆ‚จๅš่ฉณ็ดฐ็š„ๆชขๆŸฅ๏ผŒไปฅ็ขบๅฎš็—…ๅ› ๏ผŒไธฆๆŽฅๅ—้ฉ็•ถ็š„ๆฒป็™‚ใ€‚\n็ฅๅฅๅบท๏ผ '] Unsloth: Merging 4bit and LoRA weights to 16bit... Unsloth: Will use up to 46.9 out of 62.57 RAM for saving. 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 32/32 [00:00<00:00, 90.79it/s] Unsloth: Saving tokenizer... Done. Unsloth: Saving model... This might take 5 minutes for Llama-7b... Done. Unsloth: Converting llama model. Can use fast conversion = True. ==((====))== Unsloth: Conversion from QLoRA to GGUF information \ /| [0] Installing llama.cpp will take 3 minutes. O^O/ _/ \ [1] Converting HF to GUUF 16bits will take 3 minutes. \ / [2] Converting GGUF 16bits to q4_k_m will take 20 minutes. "--" In total, you will have to wait around 26 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes... Unsloth: [1] Converting model at ch_taide_medicine.gguf into f16 GGUF format. The output location will be ./ch_taide_medicine.gguf-unsloth.F16.gguf This will take 3 minutes... Loading model file ch_taide_medicine.gguf/model-00001-of-00003.safetensors Loading model file ch_taide_medicine.gguf/model-00001-of-00003.safetensors Loading model file ch_taide_medicine.gguf/model-00002-of-00003.safetensors Loading model file ch_taide_medicine.gguf/model-00003-of-00003.safetensors params = Params(n_vocab=56064, n_embd=4096, n_layer=32, n_ctx=4096, n_ff=11008, n_head=32, n_head_kv=32, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=10000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('ch_taide_medicine.gguf')) Loaded vocab file PosixPath('ch_taide_medicine.gguf/tokenizer.json'), type 'hfft' Vocab info: <LlamaHfVocab with 56020 base tokens and 0 added tokens> Special vocab info: <SpecialVocab with 0 merges, special tokens {'bos': 1, 'eos': 2, 'unk': 0, 'pad': 32000}, add special tokens {'bos': True, 'eos': False}> Permuting layer 0 Permuting layer 1 Permuting layer 2 Permuting layer 3 Permuting layer 4 Permuting layer 5 Permuting layer 6 Permuting layer 7 Permuting layer 8 Permuting layer 9 Permuting layer 10 Permuting layer 11 Permuting layer 12 Permuting layer 13 Permuting layer 14 Permuting layer 15 Permuting layer 16 Permuting layer 17 Permuting layer 18 Permuting layer 19 Permuting layer 20 Permuting layer 21 Permuting layer 22 Permuting layer 23 Permuting layer 24 Permuting layer 25 Permuting layer 26 Permuting layer 27 Permuting layer 28 Permuting layer 29 Permuting layer 30 Permuting layer 31 model.embed_tokens.weight -> token_embd.weight | BF16 | [56064, 4096] model.layers.0.input_layernorm.weight -> blk.0.attn_norm.weight | BF16 | [4096] model.layers.0.mlp.down_proj.weight -> blk.0.ffn_down.weight | BF16 | [4096, 11008] model.layers.0.mlp.gate_proj.weight -> blk.0.ffn_gate.weight | BF16 | [11008, 4096] model.layers.0.mlp.up_proj.weight -> blk.0.ffn_up.weight | BF16 | [11008, 4096] model.layers.0.post_attention_layernorm.weight -> blk.0.ffn_norm.weight | BF16 | [4096] model.layers.0.self_attn.k_proj.weight -> blk.0.attn_k.weight | BF16 | [4096, 4096] model.layers.0.self_attn.o_proj.weight -> blk.0.attn_output.weight | BF16 | [4096, 4096] model.layers.0.self_attn.q_proj.weight -> blk.0.attn_q.weight | BF16 | [4096, 4096] model.layers.0.self_attn.v_proj.weight -> blk.0.attn_v.weight | BF16 | [4096, 4096] model.layers.1.input_layernorm.weight -> blk.1.attn_norm.weight | BF16 | [4096] model.layers.1.mlp.down_proj.weight -> blk.1.ffn_down.weight | BF16 | [4096, 11008] model.layers.1.mlp.gate_proj.weight -> blk.1.ffn_gate.weight | BF16 | [11008, 4096] model.layers.1.mlp.up_proj.weight -> blk.1.ffn_up.weight | BF16 | [11008, 4096] model.layers.1.post_attention_layernorm.weight -> blk.1.ffn_norm.weight | BF16 | [4096] model.layers.1.self_attn.k_proj.weight -> blk.1.attn_k.weight | BF16 | [4096, 4096] model.layers.1.self_attn.o_proj.weight -> blk.1.attn_output.weight | BF16 | [4096, 4096] model.layers.1.self_attn.q_proj.weight -> blk.1.attn_q.weight | BF16 | [4096, 4096] model.layers.1.self_attn.v_proj.weight -> blk.1.attn_v.weight | BF16 | [4096, 4096] model.layers.10.input_layernorm.weight -> blk.10.attn_norm.weight | BF16 | [4096] model.layers.10.mlp.down_proj.weight -> blk.10.ffn_down.weight | BF16 | [4096, 11008] model.layers.10.mlp.gate_proj.weight -> blk.10.ffn_gate.weight | BF16 | [11008, 4096] model.layers.10.mlp.up_proj.weight -> blk.10.ffn_up.weight | BF16 | [11008, 4096] model.layers.10.post_attention_layernorm.weight -> blk.10.ffn_norm.weight | BF16 | [4096] model.layers.10.self_attn.k_proj.weight -> blk.10.attn_k.weight | BF16 | [4096, 4096] model.layers.10.self_attn.o_proj.weight -> blk.10.attn_output.weight | BF16 | [4096, 4096] model.layers.10.self_attn.q_proj.weight -> blk.10.attn_q.weight | BF16 | [4096, 4096] model.layers.10.self_attn.v_proj.weight -> blk.10.attn_v.weight | BF16 | [4096, 4096] model.layers.11.self_attn.k_proj.weight -> blk.11.attn_k.weight | BF16 | [4096, 4096] model.layers.11.self_attn.q_proj.weight -> blk.11.attn_q.weight | BF16 | [4096, 4096] model.layers.2.input_layernorm.weight -> blk.2.attn_norm.weight | BF16 | [4096] model.layers.2.mlp.down_proj.weight -> blk.2.ffn_down.weight | BF16 | [4096, 11008] model.layers.2.mlp.gate_proj.weight -> blk.2.ffn_gate.weight | BF16 | [11008, 4096] model.layers.2.mlp.up_proj.weight -> blk.2.ffn_up.weight | BF16 | [11008, 4096] model.layers.2.post_attention_layernorm.weight -> blk.2.ffn_norm.weight | BF16 | [4096] model.layers.2.self_attn.k_proj.weight -> blk.2.attn_k.weight | BF16 | [4096, 4096] model.layers.2.self_attn.o_proj.weight -> blk.2.attn_output.weight | BF16 | [4096, 4096] model.layers.2.self_attn.q_proj.weight -> blk.2.attn_q.weight | BF16 | [4096, 4096] model.layers.2.self_attn.v_proj.weight -> blk.2.attn_v.weight | BF16 | [4096, 4096] model.layers.3.input_layernorm.weight -> blk.3.attn_norm.weight | BF16 | [4096] model.layers.3.mlp.down_proj.weight -> blk.3.ffn_down.weight | BF16 | [4096, 11008] model.layers.3.mlp.gate_proj.weight -> blk.3.ffn_gate.weight | BF16 | [11008, 4096] model.layers.3.mlp.up_proj.weight -> blk.3.ffn_up.weight | BF16 | [11008, 4096] model.layers.3.post_attention_layernorm.weight -> blk.3.ffn_norm.weight | BF16 | [4096] model.layers.3.self_attn.k_proj.weight -> blk.3.attn_k.weight | BF16 | [4096, 4096] model.layers.3.self_attn.o_proj.weight -> blk.3.attn_output.weight | BF16 | [4096, 4096] model.layers.3.self_attn.q_proj.weight -> blk.3.attn_q.weight | BF16 | [4096, 4096] model.layers.3.self_attn.v_proj.weight -> blk.3.attn_v.weight | BF16 | [4096, 4096] model.layers.4.input_layernorm.weight -> blk.4.attn_norm.weight | BF16 | [4096] model.layers.4.mlp.down_proj.weight -> blk.4.ffn_down.weight | BF16 | [4096, 11008] model.layers.4.mlp.gate_proj.weight -> blk.4.ffn_gate.weight | BF16 | [11008, 4096] model.layers.4.mlp.up_proj.weight -> blk.4.ffn_up.weight | BF16 | [11008, 4096] model.layers.4.post_attention_layernorm.weight -> blk.4.ffn_norm.weight | BF16 | [4096] model.layers.4.self_attn.k_proj.weight -> blk.4.attn_k.weight | BF16 | [4096, 4096] model.layers.4.self_attn.o_proj.weight -> blk.4.attn_output.weight | BF16 | [4096, 4096] model.layers.4.self_attn.q_proj.weight -> blk.4.attn_q.weight | BF16 | [4096, 4096] model.layers.4.self_attn.v_proj.weight -> blk.4.attn_v.weight | BF16 | [4096, 4096] model.layers.5.input_layernorm.weight -> blk.5.attn_norm.weight | BF16 | [4096] model.layers.5.mlp.down_proj.weight -> blk.5.ffn_down.weight | BF16 | [4096, 11008] model.layers.5.mlp.gate_proj.weight -> blk.5.ffn_gate.weight | BF16 | [11008, 4096] model.layers.5.mlp.up_proj.weight -> blk.5.ffn_up.weight | BF16 | [11008, 4096] model.layers.5.post_attention_layernorm.weight -> blk.5.ffn_norm.weight | BF16 | [4096] model.layers.5.self_attn.k_proj.weight -> blk.5.attn_k.weight | BF16 | [4096, 4096] model.layers.5.self_attn.o_proj.weight -> blk.5.attn_output.weight | BF16 | [4096, 4096] model.layers.5.self_attn.q_proj.weight -> blk.5.attn_q.weight | BF16 | [4096, 4096] model.layers.5.self_attn.v_proj.weight -> blk.5.attn_v.weight | BF16 | [4096, 4096] model.layers.6.input_layernorm.weight -> blk.6.attn_norm.weight | BF16 | [4096] model.layers.6.mlp.down_proj.weight -> blk.6.ffn_down.weight | BF16 | [4096, 11008] model.layers.6.mlp.gate_proj.weight -> blk.6.ffn_gate.weight | BF16 | [11008, 4096] model.layers.6.mlp.up_proj.weight -> blk.6.ffn_up.weight | BF16 | [11008, 4096] model.layers.6.post_attention_layernorm.weight -> blk.6.ffn_norm.weight | BF16 | [4096] model.layers.6.self_attn.k_proj.weight -> blk.6.attn_k.weight | BF16 | [4096, 4096] model.layers.6.self_attn.o_proj.weight -> blk.6.attn_output.weight | BF16 | [4096, 4096] model.layers.6.self_attn.q_proj.weight -> blk.6.attn_q.weight | BF16 | [4096, 4096] model.layers.6.self_attn.v_proj.weight -> blk.6.attn_v.weight | BF16 | [4096, 4096] model.layers.7.input_layernorm.weight -> blk.7.attn_norm.weight | BF16 | [4096] model.layers.7.mlp.down_proj.weight -> blk.7.ffn_down.weight | BF16 | [4096, 11008] model.layers.7.mlp.gate_proj.weight -> blk.7.ffn_gate.weight | BF16 | [11008, 4096] model.layers.7.mlp.up_proj.weight -> blk.7.ffn_up.weight | BF16 | [11008, 4096] model.layers.7.post_attention_layernorm.weight -> blk.7.ffn_norm.weight | BF16 | [4096] model.layers.7.self_attn.k_proj.weight -> blk.7.attn_k.weight | BF16 | [4096, 4096] model.layers.7.self_attn.o_proj.weight -> blk.7.attn_output.weight | BF16 | [4096, 4096] model.layers.7.self_attn.q_proj.weight -> blk.7.attn_q.weight | BF16 | [4096, 4096] model.layers.7.self_attn.v_proj.weight -> blk.7.attn_v.weight | BF16 | [4096, 4096] model.layers.8.input_layernorm.weight -> blk.8.attn_norm.weight | BF16 | [4096] model.layers.8.mlp.down_proj.weight -> blk.8.ffn_down.weight | BF16 | [4096, 11008] model.layers.8.mlp.gate_proj.weight -> blk.8.ffn_gate.weight | BF16 | [11008, 4096] model.layers.8.mlp.up_proj.weight -> blk.8.ffn_up.weight | BF16 | [11008, 4096] model.layers.8.post_attention_layernorm.weight -> blk.8.ffn_norm.weight | BF16 | [4096] model.layers.8.self_attn.k_proj.weight -> blk.8.attn_k.weight | BF16 | [4096, 4096] model.layers.8.self_attn.o_proj.weight -> blk.8.attn_output.weight | BF16 | [4096, 4096] model.layers.8.self_attn.q_proj.weight -> blk.8.attn_q.weight | BF16 | [4096, 4096] model.layers.8.self_attn.v_proj.weight -> blk.8.attn_v.weight | BF16 | [4096, 4096] model.layers.9.input_layernorm.weight -> blk.9.attn_norm.weight | BF16 | [4096] model.layers.9.mlp.down_proj.weight -> blk.9.ffn_down.weight | BF16 | [4096, 11008] model.layers.9.mlp.gate_proj.weight -> blk.9.ffn_gate.weight | BF16 | [11008, 4096] model.layers.9.mlp.up_proj.weight -> blk.9.ffn_up.weight | BF16 | [11008, 4096] model.layers.9.post_attention_layernorm.weight -> blk.9.ffn_norm.weight | BF16 | [4096] model.layers.9.self_attn.k_proj.weight -> blk.9.attn_k.weight | BF16 | [4096, 4096] model.layers.9.self_attn.o_proj.weight -> blk.9.attn_output.weight | BF16 | [4096, 4096] model.layers.9.self_attn.q_proj.weight -> blk.9.attn_q.weight | BF16 | [4096, 4096] model.layers.9.self_attn.v_proj.weight -> blk.9.attn_v.weight | BF16 | [4096, 4096] model.layers.11.input_layernorm.weight -> blk.11.attn_norm.weight | BF16 | [4096] model.layers.11.mlp.down_proj.weight -> blk.11.ffn_down.weight | BF16 | [4096, 11008] model.layers.11.mlp.gate_proj.weight -> blk.11.ffn_gate.weight | BF16 | [11008, 4096] model.layers.11.mlp.up_proj.weight -> blk.11.ffn_up.weight | BF16 | [11008, 4096] model.layers.11.post_attention_layernorm.weight -> blk.11.ffn_norm.weight | BF16 | [4096] model.layers.11.self_attn.o_proj.weight -> blk.11.attn_output.weight | BF16 | [4096, 4096] model.layers.11.self_attn.v_proj.weight -> blk.11.attn_v.weight | BF16 | [4096, 4096] model.layers.12.input_layernorm.weight -> blk.12.attn_norm.weight | BF16 | [4096] model.layers.12.mlp.down_proj.weight -> blk.12.ffn_down.weight | BF16 | [4096, 11008] model.layers.12.mlp.gate_proj.weight -> blk.12.ffn_gate.weight | BF16 | [11008, 4096] model.layers.12.mlp.up_proj.weight -> blk.12.ffn_up.weight | BF16 | [11008, 4096] model.layers.12.post_attention_layernorm.weight -> blk.12.ffn_norm.weight | BF16 | [4096] model.layers.12.self_attn.k_proj.weight -> blk.12.attn_k.weight | BF16 | [4096, 4096] model.layers.12.self_attn.o_proj.weight -> blk.12.attn_output.weight | BF16 | [4096, 4096] model.layers.12.self_attn.q_proj.weight -> blk.12.attn_q.weight | BF16 | [4096, 4096] model.layers.12.self_attn.v_proj.weight -> blk.12.attn_v.weight | BF16 | [4096, 4096] model.layers.13.input_layernorm.weight -> blk.13.attn_norm.weight | BF16 | [4096] model.layers.13.mlp.down_proj.weight -> blk.13.ffn_down.weight | BF16 | [4096, 11008] model.layers.13.mlp.gate_proj.weight -> blk.13.ffn_gate.weight | BF16 | [11008, 4096] model.layers.13.mlp.up_proj.weight -> blk.13.ffn_up.weight | BF16 | [11008, 4096] model.layers.13.post_attention_layernorm.weight -> blk.13.ffn_norm.weight | BF16 | [4096] model.layers.13.self_attn.k_proj.weight -> blk.13.attn_k.weight | BF16 | [4096, 4096] model.layers.13.self_attn.o_proj.weight -> blk.13.attn_output.weight | BF16 | [4096, 4096] model.layers.13.self_attn.q_proj.weight -> blk.13.attn_q.weight | BF16 | [4096, 4096] model.layers.13.self_attn.v_proj.weight -> blk.13.attn_v.weight | BF16 | [4096, 4096] model.layers.14.input_layernorm.weight -> blk.14.attn_norm.weight | BF16 | [4096] model.layers.14.mlp.down_proj.weight -> blk.14.ffn_down.weight | BF16 | [4096, 11008] model.layers.14.mlp.gate_proj.weight -> blk.14.ffn_gate.weight | BF16 | [11008, 4096] model.layers.14.mlp.up_proj.weight -> blk.14.ffn_up.weight | BF16 | [11008, 4096] model.layers.14.post_attention_layernorm.weight -> blk.14.ffn_norm.weight | BF16 | [4096] model.layers.14.self_attn.k_proj.weight -> blk.14.attn_k.weight | BF16 | [4096, 4096] model.layers.14.self_attn.o_proj.weight -> blk.14.attn_output.weight | BF16 | [4096, 4096] model.layers.14.self_attn.q_proj.weight -> blk.14.attn_q.weight | BF16 | [4096, 4096] model.layers.14.self_attn.v_proj.weight -> blk.14.attn_v.weight | BF16 | [4096, 4096] model.layers.15.input_layernorm.weight -> blk.15.attn_norm.weight | BF16 | [4096] model.layers.15.mlp.down_proj.weight -> blk.15.ffn_down.weight | BF16 | [4096, 11008] model.layers.15.mlp.gate_proj.weight -> blk.15.ffn_gate.weight | BF16 | [11008, 4096] model.layers.15.mlp.up_proj.weight -> blk.15.ffn_up.weight | BF16 | [11008, 4096] model.layers.15.post_attention_layernorm.weight -> blk.15.ffn_norm.weight | BF16 | [4096] model.layers.15.self_attn.k_proj.weight -> blk.15.attn_k.weight | BF16 | [4096, 4096] model.layers.15.self_attn.o_proj.weight -> blk.15.attn_output.weight | BF16 | [4096, 4096] model.layers.15.self_attn.q_proj.weight -> blk.15.attn_q.weight | BF16 | [4096, 4096] model.layers.15.self_attn.v_proj.weight -> blk.15.attn_v.weight | BF16 | [4096, 4096] model.layers.16.input_layernorm.weight -> blk.16.attn_norm.weight | BF16 | [4096] model.layers.16.mlp.down_proj.weight -> blk.16.ffn_down.weight | BF16 | [4096, 11008] model.layers.16.mlp.gate_proj.weight -> blk.16.ffn_gate.weight | BF16 | [11008, 4096] model.layers.16.mlp.up_proj.weight -> blk.16.ffn_up.weight | BF16 | [11008, 4096] model.layers.16.post_attention_layernorm.weight -> blk.16.ffn_norm.weight | BF16 | [4096] model.layers.16.self_attn.k_proj.weight -> blk.16.attn_k.weight | BF16 | [4096, 4096] model.layers.16.self_attn.o_proj.weight -> blk.16.attn_output.weight | BF16 | [4096, 4096] model.layers.16.self_attn.q_proj.weight -> blk.16.attn_q.weight | BF16 | [4096, 4096] model.layers.16.self_attn.v_proj.weight -> blk.16.attn_v.weight | BF16 | [4096, 4096] model.layers.17.input_layernorm.weight -> blk.17.attn_norm.weight | BF16 | [4096] model.layers.17.mlp.down_proj.weight -> blk.17.ffn_down.weight | BF16 | [4096, 11008] model.layers.17.mlp.gate_proj.weight -> blk.17.ffn_gate.weight | BF16 | [11008, 4096] model.layers.17.mlp.up_proj.weight -> blk.17.ffn_up.weight | BF16 | [11008, 4096] model.layers.17.post_attention_layernorm.weight -> blk.17.ffn_norm.weight | BF16 | [4096] model.layers.17.self_attn.k_proj.weight -> blk.17.attn_k.weight | BF16 | [4096, 4096] model.layers.17.self_attn.o_proj.weight -> blk.17.attn_output.weight | BF16 | [4096, 4096] model.layers.17.self_attn.q_proj.weight -> blk.17.attn_q.weight | BF16 | [4096, 4096] model.layers.17.self_attn.v_proj.weight -> blk.17.attn_v.weight | BF16 | [4096, 4096] model.layers.18.input_layernorm.weight -> blk.18.attn_norm.weight | BF16 | [4096] model.layers.18.mlp.down_proj.weight -> blk.18.ffn_down.weight | BF16 | [4096, 11008] model.layers.18.mlp.gate_proj.weight -> blk.18.ffn_gate.weight | BF16 | [11008, 4096] model.layers.18.mlp.up_proj.weight -> blk.18.ffn_up.weight | BF16 | [11008, 4096] model.layers.18.post_attention_layernorm.weight -> blk.18.ffn_norm.weight | BF16 | [4096] model.layers.18.self_attn.k_proj.weight -> blk.18.attn_k.weight | BF16 | [4096, 4096] model.layers.18.self_attn.o_proj.weight -> blk.18.attn_output.weight | BF16 | [4096, 4096] model.layers.18.self_attn.q_proj.weight -> blk.18.attn_q.weight | BF16 | [4096, 4096] model.layers.18.self_attn.v_proj.weight -> blk.18.attn_v.weight | BF16 | [4096, 4096] model.layers.19.input_layernorm.weight -> blk.19.attn_norm.weight | BF16 | [4096] model.layers.19.mlp.down_proj.weight -> blk.19.ffn_down.weight | BF16 | [4096, 11008] model.layers.19.mlp.gate_proj.weight -> blk.19.ffn_gate.weight | BF16 | [11008, 4096] model.layers.19.mlp.up_proj.weight -> blk.19.ffn_up.weight | BF16 | [11008, 4096] model.layers.19.post_attention_layernorm.weight -> blk.19.ffn_norm.weight | BF16 | [4096] model.layers.19.self_attn.k_proj.weight -> blk.19.attn_k.weight | BF16 | [4096, 4096] model.layers.19.self_attn.o_proj.weight -> blk.19.attn_output.weight | BF16 | [4096, 4096] model.layers.19.self_attn.q_proj.weight -> blk.19.attn_q.weight | BF16 | [4096, 4096] model.layers.19.self_attn.v_proj.weight -> blk.19.attn_v.weight | BF16 | [4096, 4096] model.layers.20.input_layernorm.weight -> blk.20.attn_norm.weight | BF16 | [4096] model.layers.20.mlp.down_proj.weight -> blk.20.ffn_down.weight | BF16 | [4096, 11008] model.layers.20.mlp.gate_proj.weight -> blk.20.ffn_gate.weight | BF16 | [11008, 4096] model.layers.20.mlp.up_proj.weight -> blk.20.ffn_up.weight | BF16 | [11008, 4096] model.layers.20.post_attention_layernorm.weight -> blk.20.ffn_norm.weight | BF16 | [4096] model.layers.20.self_attn.k_proj.weight -> blk.20.attn_k.weight | BF16 | [4096, 4096] model.layers.20.self_attn.o_proj.weight -> blk.20.attn_output.weight | BF16 | [4096, 4096] model.layers.20.self_attn.q_proj.weight -> blk.20.attn_q.weight | BF16 | [4096, 4096] model.layers.20.self_attn.v_proj.weight -> blk.20.attn_v.weight | BF16 | [4096, 4096] model.layers.21.input_layernorm.weight -> blk.21.attn_norm.weight | BF16 | [4096] model.layers.21.mlp.down_proj.weight -> blk.21.ffn_down.weight | BF16 | [4096, 11008] model.layers.21.mlp.gate_proj.weight -> blk.21.ffn_gate.weight | BF16 | [11008, 4096] model.layers.21.mlp.up_proj.weight -> blk.21.ffn_up.weight | BF16 | [11008, 4096] model.layers.21.post_attention_layernorm.weight -> blk.21.ffn_norm.weight | BF16 | [4096] model.layers.21.self_attn.k_proj.weight -> blk.21.attn_k.weight | BF16 | [4096, 4096] model.layers.21.self_attn.o_proj.weight -> blk.21.attn_output.weight | BF16 | [4096, 4096] model.layers.21.self_attn.q_proj.weight -> blk.21.attn_q.weight | BF16 | [4096, 4096] model.layers.21.self_attn.v_proj.weight -> blk.21.attn_v.weight | BF16 | [4096, 4096] model.layers.22.input_layernorm.weight -> blk.22.attn_norm.weight | BF16 | [4096] model.layers.22.mlp.down_proj.weight -> blk.22.ffn_down.weight | BF16 | [4096, 11008] model.layers.22.mlp.gate_proj.weight -> blk.22.ffn_gate.weight | BF16 | [11008, 4096] model.layers.22.mlp.up_proj.weight -> blk.22.ffn_up.weight | BF16 | [11008, 4096] model.layers.22.post_attention_layernorm.weight -> blk.22.ffn_norm.weight | BF16 | [4096] model.layers.22.self_attn.k_proj.weight -> blk.22.attn_k.weight | BF16 | [4096, 4096] model.layers.22.self_attn.o_proj.weight -> blk.22.attn_output.weight | BF16 | [4096, 4096] model.layers.22.self_attn.q_proj.weight -> blk.22.attn_q.weight | BF16 | [4096, 4096] model.layers.22.self_attn.v_proj.weight -> blk.22.attn_v.weight | BF16 | [4096, 4096] model.layers.23.self_attn.k_proj.weight -> blk.23.attn_k.weight | BF16 | [4096, 4096] model.layers.23.self_attn.o_proj.weight -> blk.23.attn_output.weight | BF16 | [4096, 4096] model.layers.23.self_attn.q_proj.weight -> blk.23.attn_q.weight | BF16 | [4096, 4096] model.layers.23.self_attn.v_proj.weight -> blk.23.attn_v.weight | BF16 | [4096, 4096] lm_head.weight -> output.weight | BF16 | [56064, 4096] model.layers.23.input_layernorm.weight -> blk.23.attn_norm.weight | BF16 | [4096] model.layers.23.mlp.down_proj.weight -> blk.23.ffn_down.weight | BF16 | [4096, 11008] model.layers.23.mlp.gate_proj.weight -> blk.23.ffn_gate.weight | BF16 | [11008, 4096] model.layers.23.mlp.up_proj.weight -> blk.23.ffn_up.weight | BF16 | [11008, 4096] model.layers.23.post_attention_layernorm.weight -> blk.23.ffn_norm.weight | BF16 | [4096] model.layers.24.input_layernorm.weight -> blk.24.attn_norm.weight | BF16 | [4096] model.layers.24.mlp.down_proj.weight -> blk.24.ffn_down.weight | BF16 | [4096, 11008] model.layers.24.mlp.gate_proj.weight -> blk.24.ffn_gate.weight | BF16 | [11008, 4096] model.layers.24.mlp.up_proj.weight -> blk.24.ffn_up.weight | BF16 | [11008, 4096] model.layers.24.post_attention_layernorm.weight -> blk.24.ffn_norm.weight | BF16 | [4096] model.layers.24.self_attn.k_proj.weight -> blk.24.attn_k.weight | BF16 | [4096, 4096] model.layers.24.self_attn.o_proj.weight -> blk.24.attn_output.weight | BF16 | [4096, 4096] model.layers.24.self_attn.q_proj.weight -> blk.24.attn_q.weight | BF16 | [4096, 4096] model.layers.24.self_attn.v_proj.weight -> blk.24.attn_v.weight | BF16 | [4096, 4096] model.layers.25.input_layernorm.weight -> blk.25.attn_norm.weight | BF16 | [4096] model.layers.25.mlp.down_proj.weight -> blk.25.ffn_down.weight | BF16 | [4096, 11008] model.layers.25.mlp.gate_proj.weight -> blk.25.ffn_gate.weight | BF16 | [11008, 4096] model.layers.25.mlp.up_proj.weight -> blk.25.ffn_up.weight | BF16 | [11008, 4096] model.layers.25.post_attention_layernorm.weight -> blk.25.ffn_norm.weight | BF16 | [4096] model.layers.25.self_attn.k_proj.weight -> blk.25.attn_k.weight | BF16 | [4096, 4096] model.layers.25.self_attn.o_proj.weight -> blk.25.attn_output.weight | BF16 | [4096, 4096] model.layers.25.self_attn.q_proj.weight -> blk.25.attn_q.weight | BF16 | [4096, 4096] model.layers.25.self_attn.v_proj.weight -> blk.25.attn_v.weight | BF16 | [4096, 4096] model.layers.26.input_layernorm.weight -> blk.26.attn_norm.weight | BF16 | [4096] model.layers.26.mlp.down_proj.weight -> blk.26.ffn_down.weight | BF16 | [4096, 11008] model.layers.26.mlp.gate_proj.weight -> blk.26.ffn_gate.weight | BF16 | [11008, 4096] model.layers.26.mlp.up_proj.weight -> blk.26.ffn_up.weight | BF16 | [11008, 4096] model.layers.26.post_attention_layernorm.weight -> blk.26.ffn_norm.weight | BF16 | [4096] model.layers.26.self_attn.k_proj.weight -> blk.26.attn_k.weight | BF16 | [4096, 4096] model.layers.26.self_attn.o_proj.weight -> blk.26.attn_output.weight | BF16 | [4096, 4096] model.layers.26.self_attn.q_proj.weight -> blk.26.attn_q.weight | BF16 | [4096, 4096] model.layers.26.self_attn.v_proj.weight -> blk.26.attn_v.weight | BF16 | [4096, 4096] model.layers.27.input_layernorm.weight -> blk.27.attn_norm.weight | BF16 | [4096] model.layers.27.mlp.down_proj.weight -> blk.27.ffn_down.weight | BF16 | [4096, 11008] model.layers.27.mlp.gate_proj.weight -> blk.27.ffn_gate.weight | BF16 | [11008, 4096] model.layers.27.mlp.up_proj.weight -> blk.27.ffn_up.weight | BF16 | [11008, 4096] model.layers.27.post_attention_layernorm.weight -> blk.27.ffn_norm.weight | BF16 | [4096] model.layers.27.self_attn.k_proj.weight -> blk.27.attn_k.weight | BF16 | [4096, 4096] model.layers.27.self_attn.o_proj.weight -> blk.27.attn_output.weight | BF16 | [4096, 4096] model.layers.27.self_attn.q_proj.weight -> blk.27.attn_q.weight | BF16 | [4096, 4096] model.layers.27.self_attn.v_proj.weight -> blk.27.attn_v.weight | BF16 | [4096, 4096] model.layers.28.input_layernorm.weight -> blk.28.attn_norm.weight | BF16 | [4096] model.layers.28.mlp.down_proj.weight -> blk.28.ffn_down.weight | BF16 | [4096, 11008] model.layers.28.mlp.gate_proj.weight -> blk.28.ffn_gate.weight | BF16 | [11008, 4096] model.layers.28.mlp.up_proj.weight -> blk.28.ffn_up.weight | BF16 | [11008, 4096] model.layers.28.post_attention_layernorm.weight -> blk.28.ffn_norm.weight | BF16 | [4096] model.layers.28.self_attn.k_proj.weight -> blk.28.attn_k.weight | BF16 | [4096, 4096] model.layers.28.self_attn.o_proj.weight -> blk.28.attn_output.weight | BF16 | [4096, 4096] model.layers.28.self_attn.q_proj.weight -> blk.28.attn_q.weight | BF16 | [4096, 4096] model.layers.28.self_attn.v_proj.weight -> blk.28.attn_v.weight | BF16 | [4096, 4096] model.layers.29.input_layernorm.weight -> blk.29.attn_norm.weight | BF16 | [4096] model.layers.29.mlp.down_proj.weight -> blk.29.ffn_down.weight | BF16 | [4096, 11008] model.layers.29.mlp.gate_proj.weight -> blk.29.ffn_gate.weight | BF16 | [11008, 4096] model.layers.29.mlp.up_proj.weight -> blk.29.ffn_up.weight | BF16 | [11008, 4096] model.layers.29.post_attention_layernorm.weight -> blk.29.ffn_norm.weight | BF16 | [4096] model.layers.29.self_attn.k_proj.weight -> blk.29.attn_k.weight | BF16 | [4096, 4096] model.layers.29.self_attn.o_proj.weight -> blk.29.attn_output.weight | BF16 | [4096, 4096] model.layers.29.self_attn.q_proj.weight -> blk.29.attn_q.weight | BF16 | [4096, 4096] model.layers.29.self_attn.v_proj.weight -> blk.29.attn_v.weight | BF16 | [4096, 4096] model.layers.30.input_layernorm.weight -> blk.30.attn_norm.weight | BF16 | [4096] model.layers.30.mlp.down_proj.weight -> blk.30.ffn_down.weight | BF16 | [4096, 11008] model.layers.30.mlp.gate_proj.weight -> blk.30.ffn_gate.weight | BF16 | [11008, 4096] model.layers.30.mlp.up_proj.weight -> blk.30.ffn_up.weight | BF16 | [11008, 4096] model.layers.30.post_attention_layernorm.weight -> blk.30.ffn_norm.weight | BF16 | [4096] model.layers.30.self_attn.k_proj.weight -> blk.30.attn_k.weight | BF16 | [4096, 4096] model.layers.30.self_attn.o_proj.weight -> blk.30.attn_output.weight | BF16 | [4096, 4096] model.layers.30.self_attn.q_proj.weight -> blk.30.attn_q.weight | BF16 | [4096, 4096] model.layers.30.self_attn.v_proj.weight -> blk.30.attn_v.weight | BF16 | [4096, 4096] model.layers.31.input_layernorm.weight -> blk.31.attn_norm.weight | BF16 | [4096] model.layers.31.mlp.down_proj.weight -> blk.31.ffn_down.weight | BF16 | [4096, 11008] model.layers.31.mlp.gate_proj.weight -> blk.31.ffn_gate.weight | BF16 | [11008, 4096] model.layers.31.mlp.up_proj.weight -> blk.31.ffn_up.weight | BF16 | [11008, 4096] model.layers.31.post_attention_layernorm.weight -> blk.31.ffn_norm.weight | BF16 | [4096] model.layers.31.self_attn.k_proj.weight -> blk.31.attn_k.weight | BF16 | [4096, 4096] model.layers.31.self_attn.o_proj.weight -> blk.31.attn_output.weight | BF16 | [4096, 4096] model.layers.31.self_attn.q_proj.weight -> blk.31.attn_q.weight | BF16 | [4096, 4096] model.layers.31.self_attn.v_proj.weight -> blk.31.attn_v.weight | BF16 | [4096, 4096] model.norm.weight -> output_norm.weight | BF16 | [4096] Writing ch_taide_medicine.gguf-unsloth.F16.gguf, format 1 Traceback (most recent call last): File "/GPUData/working/unsloth/convert__unsloth_to_gguf.py", line 44, in if True: model.save_pretrained_gguf("ch_taide_medicine.gguf", tokenizer, quantization_method = "quantized") File "/home/chtseng/envs/LM2/lib/python3.10/site-packages/unsloth/save.py", line 1333, in unsloth_save_pretrained_gguf file_location = save_to_gguf(model_type, new_save_directory, quantization_method, first_conversion, makefile) File "/home/chtseng/envs/LM2/lib/python3.10/site-packages/unsloth/save.py", line 957, in save_to_gguf raise RuntimeError( RuntimeError: Unsloth: Quantization failed for ./ch_taide_medicine.gguf-unsloth.F16.gguf You might have to compile llama.cpp yourself, then run this again. You do not need to close this Python program. Run the following commands in a new terminal: You must run this in the same folder as you're saving your model. git clone https://github.com/ggerganov/llama.cpp cd llama.cpp && make clean && LLAMA_CUDA=1 make all -j Once that's done, redo the quantization.

ch-tseng avatar Apr 16 '24 15:04 ch-tseng

@ch-tseng Sorry on the issue - you'll have to unfortunately manually convert it to GGUF via https://github.com/unslothai/unsloth/wiki#manually-saving-to-gguf.

danielhanchen avatar Apr 16 '24 17:04 danielhanchen

I'm unsure if the latest release fixes this

danielhanchen avatar Apr 21 '24 19:04 danielhanchen

This appears to still be an issue.

My setup: AMD Ryzen 5 Nvidia RTX 3060 12 GB 96 GB RAM Garuda (Arch) Linux

Here's my specific line that fails (The training is just literally 2 prompts to test the process--training works fine): model.save_pretrained_gguf(os.path.join(os.getcwd(), "Llama-3-8B-Instruct-TEST"), tokenizer, quantization_method = "q4_k_m", maximum_memory_usage = 0.5)

...which results in the familiar error:

git clone --recursive https://github.com/ggerganov/llama.cpp cd llama.cpp && make clean && make all -j Once that's done, redo the quantization.

I performed those exact steps and it does nothing. I also tried the workaround mentioned here: https://github.com/unslothai/unsloth/issues/748 ...which is just using an earlier commit of llama.cpp to address that problem--no change.

I get the same problem whether it's llama3 or llama3.1.

To be clear--it seems to successfully create "the model"--I see the 4 .safetensors files, but fails to create the quantized .gguf model.

@danielhanchen I see your mention of the manual workaround, and I see this code there:

`git clone --recursive https://github.com/ggerganov/llama.cpp make clean -C llama.cpp make all -j -C llama.cpp pip install gguf protobuf

python llama.cpp/convert-hf-to-gguf.py FOLDER --outfile OUTPUT --outtype f16`

...but from what I can tell, that is simply recreating the step which save_pretrained_gguf did successfully. It's the quantizing that failed. How do we save the quantized model from our fine-tuning, manually or otherwise?

GeneralProtectionFault avatar Sep 07 '24 20:09 GeneralProtectionFault

Ok I think I have at least a clue, in case it helps. I have not tested running the model yet, but I did manage to convert it.

So, after fine-tuning using the typical unsloth notebooks as a guide for the code, running this: model.save_pretrained_merged("merged_model", tokenizer, save_method = "merged_16bit")

...should export the safetensors files (I believe this is the Hugging Face format). Again, I think this all that is accomplished by the code I saw on the page to work around this bug manually, but I could be wrong.

So, going directly to llama.cpp Github, it seems the Readme is out of date. It references "convert.py" but that script is not in there. It looks like they replaced it with more specific scripts:

  • convert_hf_to_gguf.py
  • convert_hf_to_gguf_update.py (I have no idea what the "update" is, I didn't try this one)
  • convert_llama_ggml_to_gguf.py
  • convert_lora_to_gguf.py

Also, it mentions "quantize" as the binary which does that 2nd step. After building (w/ make), I don't see that. But, I do see this, which appears to be a simple name change:

  • llama-quantize

So essentially, given the Hugging Face safetensors files, we need to convert to gguf--WITHOUT quantizing (FP16 or FP32 is my understanding), i.e.:

python convert_hf_to_gguf.py {the_folder_with_your_model_files}

Then:

llama-quantize {source_fp16_gguf_file} {destination_filename} Q4_K_M

**** BUT **** This did not work when I tried it from the llama.cpp in the unsloth folder. If I tried, I got missing requirements. If I installed the requirements, I got a dependency mess because llama.cpp uses an older version of torch, prior to 2.4, that causes xformers conflicts, etc, dependency hell confirmed....

So, I made a separate virtual Python environment for unsloth & llama.cpp, and cloned llama.cpp into a separate location and just installed it's requirements from its own virtual environment, ran make in that folder, then ran the commands I just put above ^^^ from there on the HF model which unsloth was at least able to spit out.

I don't know enough about unsloth to know what the best solution is, or if this is even the root cause. But that seems to work. I don't know if there's a better way around that, like matching unsloth's torch dependency to llama.cpp's, etc...

GeneralProtectionFault avatar Sep 07 '24 23:09 GeneralProtectionFault

To illustrate the point w/ a concrete example--this works, however inelegant:

IMPORTANT

This python code is run from a virtual environment that installed UNSLOTH's requirements

The first subprocess calls the python binary from a 2nd virtual environment that installed LLAMA.CPP's requirements.

This lets us run the Python scripts we get from llama.cpp without running into a dependency conflict.

llamacpp_venv_bin = "/home/user/Documents/Python/venv_llamacpp/bin/python" llamacpp_hf_gguf_script = "/home/user/Documents/Python/llama.cpp/convert_hf_to_gguf.py" llamacpp_quantize_bin = "/home/user/Documents/Python/llama.cpp/llama-quantize"

new_fp16_model_name = "Llama3.1-TestModel-FP16" new_quantized_model_name = "Llama3.1-TestModel-4bit" quantize_options = ["Q8_0", "Q6_K", "Q5_K_M", "Q5_K_S", "Q4_K_M", "Q4_K_S", "Q3_K_L", "Q3_K_M", "Q3_K_S", "Q2_K"] quantize_method = "Q4_K_M"

This will save the model as safetensors files (Hugging Face format)

Creates a folder under the current working directory for whatever name is passed in as the 1st argument (that's why it will be used below in determining the paths)

model.save_pretrained_merged(new_fp16_model_name, tokenizer, save_method = "merged_16bit", maximum_memory_usage = 0.5)

This calls the python binary from the VENV made specifically for llama.cpp. This is necessary since unsloth & llama.cpp use different versions of torch.

Use the llama.cpp python scripts to convert the Hugging Face model to GGUF (this will still be 16-bit, not quantized!)

subprocess.call(shlex.split(f"{llamacpp_venv_bin} {llamacpp_hf_gguf_script} {os.path.join(os.getcwd(), new_fp16_model_name)}"))

Now, quantize it with the binary from llama.cpp

This calls the llama-quantize binary from llama.cpp, followed by the source filename and destination filename.

subprocess.call(shlex.split(f"{llamacpp_quantize_bin} {os.path.join(os.getcwd(), new_fp16_model_name, source_model_name)}-F16.gguf \ {os.path.join(os.getcwd(), new_fp16_model_name, new_quantized_model_name)}.gguf {quantize_method}"))

GeneralProtectionFault avatar Sep 08 '24 20:09 GeneralProtectionFault

@GeneralProtectionFault Apologies on the delay!

Yep so model.save_pretrained_merged will merge to 16bit, then use llama.cpp to first convert it to F16 format, then use llama-quantize to quantize it. I'm trying to make the whole process more stable

danielhanchen avatar Sep 10 '24 08:09 danielhanchen

Is this related to my issue to gemma 2 27b bnb 4bit?

vllm/aphrodite/sglang all return an error KeyError: 'model.layers.0.mlp.down_proj.weight

https://huggingface.co/unsloth/gemma-2-27b-it-bnb-4bit/discussions/5

fullstackwebdev avatar Sep 10 '24 20:09 fullstackwebdev