litgpt
litgpt copied to clipboard
Blockwise quantization only supports 16/32-bit floats, but got torch.uint8 ( `bnb.nf4` quantisation is not working)
Hello, I am using the latest version of lit-gpt. First of all, it is much cleaner than before, so amazing work. However I am facing a problem. After I convert a huggingface (Llama 2) model to lit-gpt model, it is running as expected for
- float32
- float16
- int8
But when it comes to int4, I am getting some unexpected error. Here are the logs.
Usage:
I am using the example show in this tutorial (just changed the model path)
litgpt generate base --quantize bnb.nf4 --checkpoint_dir /models/llama-2-7b-chat-litgpt/ --precision bf16-true
And I got this error:
Loading model '/models/llama-2-7b-chat-litgpt/lit_model.pth' with {'name': 'Llama-2-7b-chat-hf', 'hf_config': {'name': 'Llama-2-7b-chat-hf', 'org': 'meta-llama'}, 'scale_embeddings': False, 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'head_size': 128, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, 'norm_class_name': 'RMSNorm', 'norm_eps': 1e-05, 'mlp_class_name': 'LLaMAMLP', 'gelu_approximate': 'none', 'intermediate_size': 11008, 'rope_condense_ratio': 1, 'rope_base': 10000, 'n_expert': 0, 'n_expert_per_token': 0, 'rope_n_elem': 128}
Time to instantiate model: 0.26 seconds.
Traceback (most recent call last):
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/bin/litgpt", line 8, in <module>
sys.exit(main())
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/litgpt/__main__.py", line 143, in main
fn(**kwargs)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/litgpt/generate/base.py", line 169, in main
model = fabric.setup_module(model)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 310, in setup_module
module = self._move_model_to_device(model=module, optimizers=[])
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 997, in _move_model_to_device
model = self.to_device(model)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 528, in to_device
self._strategy.module_to_device(obj)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/strategies/single_device.py", line 59, in module_to_device
module.to(self.root_device)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to
return self._apply(convert)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply
param_applied = fn(param)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 324, in to
return self._quantize(device)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 289, in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1234, in quantize_4bit
raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
cc: @aniketmaurya @Andrei-Aksionov
Hey @Anindyadeep
Based on the error stacktrace, it looks like you are trying to load and quantize an already quantized model.
Have you done anything to the weights or it's just the weights that were downloaded and converted by LitGPT and nothing more?
Hey thanks for the reply, So the process was:
- Use a huggingface model
- Then use litgpt
litgpt convert to_litgpt --checkpoint_dircommand to convert to a litgpt format.
Ok, but what the dtype of the HuggingFace model?
If it's already in a quantized form (torch.uint8), then it might explain the error.
You can provide a link to the repo with weights and I'll check it.
Ok, but what the dtype of the HuggingFace model? If it's already in a quantized form (
torch.uint8), then it might explain the error. You can provide a link to the repo with weights and I'll check it.
Hi, I see, so here's the thing,
I initially converted the hf weights to int8 using litgpt cli and then I converted the same weights to int4 (which is now not possible), and that is the possible reason. Which means everytime, I need to start with a base litgpt weights (with fp32) or the raw HF weights right?
Let me try that, if it works, I will let you know and then we can close this issue :)
Correct. In order to use quantization you just need weights in a standard precision (fp32, fp16, bf16).
When the model is loaded and quantization is specified (e.g. bnb.nf4), the weights are quantized on the fly.
I see, got it, let me try this out, and will keep posted in this thread, thanks for the headsup
Correct. In order to use quantization you just need weights in a standard precision (
fp32,fp16,bf16). When the model is loaded and quantization is specified (e.g.bnb.nf4), the weights are quantized on the fly.
Hi so, I tried the whole process once again. Here is what my Llama 2 weights folder contains after I typed this command:
litgpt convert to_litgpt --checkpoint_dir ./models/Llama-2-7b-chat-hf/
The above run was successful and this is what ./models/Llama-2-7b-chat-hf/ folder contained
models/Llama-2-7b-chat-hf/
├── LICENSE.txt
├── README.md
├── USE_POLICY.md
├── config.json
├── generation_config.json
├── lit_model.pth
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
├── model.safetensors.index.json
├── model_config.yaml
├── pytorch_model-00001-of-00002.bin
├── pytorch_model-00002-of-00002.bin
├── pytorch_model.bin.index.json
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer.model
└── tokenizer_config.json
Now I typed this command:
litgpt generate base --quantize bnb.nf4 --checkpoint_dir models/Llama-2-7b-chat-hf --precision bf16-true --max_new_tokens 256
And got this error:
Loading model 'models/Llama-2-7b-chat-hf/lit_model.pth' with {'name': 'Llama-2-7b-chat-hf', 'hf_config': {'name': 'Llama-2-7b-chat-hf', 'org': 'meta-llama'}, 'scale_embeddings': False, 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 64, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'head_size': 128, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 32, 'shared_attention_norm': False, 'norm_class_name': 'RMSNorm', 'norm_eps': 1e-05, 'mlp_class_name': 'LLaMAMLP', 'gelu_approximate': 'none', 'intermediate_size': 11008, 'rope_condense_ratio': 1, 'rope_base': 10000, 'n_expert': 0, 'n_expert_per_token': 0, 'rope_n_elem': 128}
Time to instantiate model: 0.24 seconds.
Traceback (most recent call last):
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/bin/litgpt", line 8, in <module>
sys.exit(main())
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/litgpt/__main__.py", line 143, in main
fn(**kwargs)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/litgpt/generate/base.py", line 169, in main
model = fabric.setup_module(model)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 310, in setup_module
module = self._move_model_to_device(model=module, optimizers=[])
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 997, in _move_model_to_device
model = self.to_device(model)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 528, in to_device
self._strategy.module_to_device(obj)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/strategies/single_device.py", line 59, in module_to_device
module.to(self.root_device)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to
return self._apply(convert)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply
param_applied = fn(param)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 324, in to
return self._quantize(device)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 289, in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1234, in quantize_4bit
raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
But have you checked the dtype of the original weights (.safetensors) in ./models/Llama-2-7b-chat-hf/?
But have you checked the dtype of the weights in
./models/Llama-2-7b-chat-hf/?
you mean the weights for litgpt model or the hf model? Also as far as the hf models concerned, those are the actual raw weights of llama 2 so it is float16
Here is the HF config
{
"_name_or_path": "meta-llama/Llama-2-7b-chat-hf",
"architectures": [
"LlamaForCausalLM"
],
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 11008,
"max_position_embeddings": 4096,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.32.0.dev0",
"use_cache": true,
"vocab_size": 32000
}
And here is the lit model_config.yaml file
bias: false
block_size: 4096
gelu_approximate: none
head_size: 128
hf_config:
name: Llama-2-7b-chat-hf
org: meta-llama
intermediate_size: 11008
lm_head_bias: false
mlp_class_name: LLaMAMLP
n_embd: 4096
n_expert: 0
n_expert_per_token: 0
n_head: 32
n_layer: 32
n_query_groups: 32
name: Llama-2-7b-chat-hf
norm_class_name: RMSNorm
norm_eps: 1.0e-05
padded_vocab_size: 32000
padding_multiple: 64
parallel_residual: false
rope_base: 10000
rope_condense_ratio: 1
rotary_percentage: 1.0
scale_embeddings: false
shared_attention_norm: false
vocab_size: 32000
The same is happening for mistral too
I still don't have access neither to LlaMA 2 nor to some Mistral models.
But when I tried with Phi 2 everything worked fine.
Here is a code snippet (replace repo_id with the one you want to use):
export repo_id=microsoft/phi-2
litgpt download --repo_id $repo_id --convert_checkpoint false
litgpt convert to_litgpt --checkpoint_dir checkpoints/$repo_id
litgpt generate base --quantize bnb.nf4 --checkpoint_dir checkpoints/$repo_id --precision bf16-true --max_new_tokens 256
Okay then let me try the same with Mistral, but this time with download
I still don't have access neither to LlaMA 2 nor to some Mistral models. But when I tried with Phi 2 everything worked fine. Here is a code snippet (replace
repo_idwith the one you want to use):export repo_id=microsoft/phi-2 litgpt download --repo_id $repo_id --convert_checkpoint false litgpt convert to_litgpt --checkpoint_dir checkpoints/$repo_id litgpt generate base --quantize bnb.nf4 --checkpoint_dir checkpoints/$repo_id --precision bf16-true --max_new_tokens 256
I see but I did the same thing for Mistral v0.1. Here are the set of commands:
export repo_id=mistralai/Mistral-7B-Instruct-v0.1
litgpt download --repo_id $repo_id --convert_checkpoint false --access_token hf_...
litgpt convert to_litgpt --checkpoint_dir checkpoints/$repo_id
litgpt generate base --quantize bnb.nf4 --checkpoint_dir checkpoints/$repo_id --precision bf16-true --max_new_tokens 256
And here are the logs:
(venv) anindya@prem-ai-a100-fin-01:~/workspace/benchmarks$ litgpt download --repo_id $repo_id --convert_checkpoint false --access_token hf_...
(venv) anindya@prem-ai-a100-fin-01:~/workspace/benchmarks$ export repo_id=mistralai/Mistral-7B-Instruct-v0.1
litgpt download --repo_id $repo_id --convert_checkpoint false --access_token hf_...
litgpt convert to_litgpt --checkpoint_dir checkpoints/$repo_id
litgpt generate base --quantize bnb.nf4 --checkpoint_dir checkpoints/$repo_id --precision bf16-true --max_new_tokens 256
Setting HF_HUB_ENABLE_HF_TRANSFER=1
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 571/571 [00:00<00:00, 3.37MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 682kB/s]
pytorch_model-00001-of-00002.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.94G/9.94G [01:01<00:00, 162MB/s]
pytorch_model-00002-of-00002.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.06G/5.06G [00:34<00:00, 148MB/s]
pytorch_model.bin.index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23.9k/23.9k [00:00<00:00, 72.8MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.80M/1.80M [00:00<00:00, 3.94MB/s]
tokenizer.model: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 493k/493k [00:00<00:00, 6.39MB/s]
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.47k/1.47k [00:00<00:00, 10.1MB/s]
Processing checkpoints/mistralai/Mistral-7B-Instruct-v0.1/pytorch_model-00001-of-00002.bin
Loading 'model.embed_tokens.weight' into RAM
Loading 'model.layers.0.self_attn.o_proj.weight' into RAM
Loading 'model.layers.0.mlp.gate_proj.weight' into RAM
Loading 'model.layers.0.mlp.up_proj.weight' into RAM
Loading 'model.layers.0.mlp.down_proj.weight' into RAM
Loading 'model.layers.0.input_layernorm.weight' into RAM
Loading 'model.layers.0.post_attention_layernorm.weight' into RAM
Loading 'model.layers.1.self_attn.o_proj.weight' into RAM
Loading 'model.layers.1.mlp.gate_proj.weight' into RAM
Loading 'model.layers.1.mlp.up_proj.weight' into RAM
Loading 'model.layers.1.mlp.down_proj.weight' into RAM
Loading 'model.layers.1.input_layernorm.weight' into RAM
Loading 'model.layers.1.post_attention_layernorm.weight' into RAM
Loading 'model.layers.2.self_attn.o_proj.weight' into RAM
Loading 'model.layers.2.mlp.gate_proj.weight' into RAM
Loading 'model.layers.2.mlp.up_proj.weight' into RAM
Loading 'model.layers.2.mlp.down_proj.weight' into RAM
Loading 'model.layers.2.input_layernorm.weight' into RAM
Loading 'model.layers.2.post_attention_layernorm.weight' into RAM
Loading 'model.layers.3.self_attn.o_proj.weight' into RAM
Loading 'model.layers.3.mlp.gate_proj.weight' into RAM
Loading 'model.layers.3.mlp.up_proj.weight' into RAM
Loading 'model.layers.3.mlp.down_proj.weight' into RAM
Loading 'model.layers.3.input_layernorm.weight' into RAM
Loading 'model.layers.3.post_attention_layernorm.weight' into RAM
Loading 'model.layers.4.self_attn.o_proj.weight' into RAM
Loading 'model.layers.4.mlp.gate_proj.weight' into RAM
Loading 'model.layers.4.mlp.up_proj.weight' into RAM
Loading 'model.layers.4.mlp.down_proj.weight' into RAM
Loading 'model.layers.4.input_layernorm.weight' into RAM
Loading 'model.layers.4.post_attention_layernorm.weight' into RAM
Loading 'model.layers.5.self_attn.o_proj.weight' into RAM
Loading 'model.layers.5.mlp.gate_proj.weight' into RAM
Loading 'model.layers.5.mlp.up_proj.weight' into RAM
Loading 'model.layers.5.mlp.down_proj.weight' into RAM
Loading 'model.layers.5.input_layernorm.weight' into RAM
Loading 'model.layers.5.post_attention_layernorm.weight' into RAM
Loading 'model.layers.6.self_attn.o_proj.weight' into RAM
Loading 'model.layers.6.mlp.gate_proj.weight' into RAM
Loading 'model.layers.6.mlp.up_proj.weight' into RAM
Loading 'model.layers.6.mlp.down_proj.weight' into RAM
Loading 'model.layers.6.input_layernorm.weight' into RAM
Loading 'model.layers.6.post_attention_layernorm.weight' into RAM
Loading 'model.layers.7.self_attn.o_proj.weight' into RAM
Loading 'model.layers.7.mlp.gate_proj.weight' into RAM
Loading 'model.layers.7.mlp.up_proj.weight' into RAM
Loading 'model.layers.7.mlp.down_proj.weight' into RAM
Loading 'model.layers.7.input_layernorm.weight' into RAM
Loading 'model.layers.7.post_attention_layernorm.weight' into RAM
Loading 'model.layers.8.self_attn.o_proj.weight' into RAM
Loading 'model.layers.8.mlp.gate_proj.weight' into RAM
Loading 'model.layers.8.mlp.up_proj.weight' into RAM
Loading 'model.layers.8.mlp.down_proj.weight' into RAM
Loading 'model.layers.8.input_layernorm.weight' into RAM
Loading 'model.layers.8.post_attention_layernorm.weight' into RAM
Loading 'model.layers.9.self_attn.o_proj.weight' into RAM
Loading 'model.layers.9.mlp.gate_proj.weight' into RAM
Loading 'model.layers.9.mlp.up_proj.weight' into RAM
Loading 'model.layers.9.mlp.down_proj.weight' into RAM
Loading 'model.layers.9.input_layernorm.weight' into RAM
Loading 'model.layers.9.post_attention_layernorm.weight' into RAM
Loading 'model.layers.10.self_attn.o_proj.weight' into RAM
Loading 'model.layers.10.mlp.gate_proj.weight' into RAM
Loading 'model.layers.10.mlp.up_proj.weight' into RAM
Loading 'model.layers.10.mlp.down_proj.weight' into RAM
Loading 'model.layers.10.input_layernorm.weight' into RAM
Loading 'model.layers.10.post_attention_layernorm.weight' into RAM
Loading 'model.layers.11.self_attn.o_proj.weight' into RAM
Loading 'model.layers.11.mlp.gate_proj.weight' into RAM
Loading 'model.layers.11.mlp.up_proj.weight' into RAM
Loading 'model.layers.11.mlp.down_proj.weight' into RAM
Loading 'model.layers.11.input_layernorm.weight' into RAM
Loading 'model.layers.11.post_attention_layernorm.weight' into RAM
Loading 'model.layers.12.self_attn.o_proj.weight' into RAM
Loading 'model.layers.12.mlp.gate_proj.weight' into RAM
Loading 'model.layers.12.mlp.up_proj.weight' into RAM
Loading 'model.layers.12.mlp.down_proj.weight' into RAM
Loading 'model.layers.12.input_layernorm.weight' into RAM
Loading 'model.layers.12.post_attention_layernorm.weight' into RAM
Loading 'model.layers.13.self_attn.o_proj.weight' into RAM
Loading 'model.layers.13.mlp.gate_proj.weight' into RAM
Loading 'model.layers.13.mlp.up_proj.weight' into RAM
Loading 'model.layers.13.mlp.down_proj.weight' into RAM
Loading 'model.layers.13.input_layernorm.weight' into RAM
Loading 'model.layers.13.post_attention_layernorm.weight' into RAM
Loading 'model.layers.14.self_attn.o_proj.weight' into RAM
Loading 'model.layers.14.mlp.gate_proj.weight' into RAM
Loading 'model.layers.14.mlp.up_proj.weight' into RAM
Loading 'model.layers.14.mlp.down_proj.weight' into RAM
Loading 'model.layers.14.input_layernorm.weight' into RAM
Loading 'model.layers.14.post_attention_layernorm.weight' into RAM
Loading 'model.layers.15.self_attn.o_proj.weight' into RAM
Loading 'model.layers.15.mlp.gate_proj.weight' into RAM
Loading 'model.layers.15.mlp.up_proj.weight' into RAM
Loading 'model.layers.15.mlp.down_proj.weight' into RAM
Loading 'model.layers.15.input_layernorm.weight' into RAM
Loading 'model.layers.15.post_attention_layernorm.weight' into RAM
Loading 'model.layers.16.self_attn.o_proj.weight' into RAM
Loading 'model.layers.16.mlp.gate_proj.weight' into RAM
Loading 'model.layers.16.mlp.up_proj.weight' into RAM
Loading 'model.layers.16.mlp.down_proj.weight' into RAM
Loading 'model.layers.16.input_layernorm.weight' into RAM
Loading 'model.layers.16.post_attention_layernorm.weight' into RAM
Loading 'model.layers.17.self_attn.o_proj.weight' into RAM
Loading 'model.layers.17.mlp.gate_proj.weight' into RAM
Loading 'model.layers.17.mlp.up_proj.weight' into RAM
Loading 'model.layers.17.mlp.down_proj.weight' into RAM
Loading 'model.layers.17.input_layernorm.weight' into RAM
Loading 'model.layers.17.post_attention_layernorm.weight' into RAM
Loading 'model.layers.18.self_attn.o_proj.weight' into RAM
Loading 'model.layers.18.mlp.gate_proj.weight' into RAM
Loading 'model.layers.18.mlp.up_proj.weight' into RAM
Loading 'model.layers.18.mlp.down_proj.weight' into RAM
Loading 'model.layers.18.input_layernorm.weight' into RAM
Loading 'model.layers.18.post_attention_layernorm.weight' into RAM
Loading 'model.layers.19.self_attn.o_proj.weight' into RAM
Loading 'model.layers.19.mlp.gate_proj.weight' into RAM
Loading 'model.layers.19.mlp.up_proj.weight' into RAM
Loading 'model.layers.19.mlp.down_proj.weight' into RAM
Loading 'model.layers.19.input_layernorm.weight' into RAM
Loading 'model.layers.19.post_attention_layernorm.weight' into RAM
Loading 'model.layers.20.self_attn.o_proj.weight' into RAM
Loading 'model.layers.20.mlp.gate_proj.weight' into RAM
Loading 'model.layers.20.mlp.up_proj.weight' into RAM
Loading 'model.layers.20.mlp.down_proj.weight' into RAM
Loading 'model.layers.20.input_layernorm.weight' into RAM
Loading 'model.layers.20.post_attention_layernorm.weight' into RAM
Loading 'model.layers.21.self_attn.o_proj.weight' into RAM
Loading 'model.layers.21.mlp.gate_proj.weight' into RAM
Loading 'model.layers.21.mlp.up_proj.weight' into RAM
Loading 'model.layers.21.mlp.down_proj.weight' into RAM
Loading 'model.layers.21.input_layernorm.weight' into RAM
Loading 'model.layers.21.post_attention_layernorm.weight' into RAM
Loading 'model.layers.22.self_attn.o_proj.weight' into RAM
Loading 'layer 0 q' into RAM
Loading 'layer 0 k' into RAM
Loading 'layer 0 v' into RAM
Loading 'layer 1 q' into RAM
Loading 'layer 1 k' into RAM
Loading 'layer 1 v' into RAM
Loading 'layer 2 q' into RAM
Loading 'layer 2 k' into RAM
Loading 'layer 2 v' into RAM
Loading 'layer 3 q' into RAM
Loading 'layer 3 k' into RAM
Loading 'layer 3 v' into RAM
Loading 'layer 4 q' into RAM
Loading 'layer 4 k' into RAM
Loading 'layer 4 v' into RAM
Loading 'layer 5 q' into RAM
Loading 'layer 5 k' into RAM
Loading 'layer 5 v' into RAM
Loading 'layer 6 q' into RAM
Loading 'layer 6 k' into RAM
Loading 'layer 6 v' into RAM
Loading 'layer 7 q' into RAM
Loading 'layer 7 k' into RAM
Loading 'layer 7 v' into RAM
Loading 'layer 8 q' into RAM
Loading 'layer 8 k' into RAM
Loading 'layer 8 v' into RAM
Loading 'layer 9 q' into RAM
Loading 'layer 9 k' into RAM
Loading 'layer 9 v' into RAM
Loading 'layer 10 q' into RAM
Loading 'layer 10 k' into RAM
Loading 'layer 10 v' into RAM
Loading 'layer 11 q' into RAM
Loading 'layer 11 k' into RAM
Loading 'layer 11 v' into RAM
Loading 'layer 12 q' into RAM
Loading 'layer 12 k' into RAM
Loading 'layer 12 v' into RAM
Loading 'layer 13 q' into RAM
Loading 'layer 13 k' into RAM
Loading 'layer 13 v' into RAM
Loading 'layer 14 q' into RAM
Loading 'layer 14 k' into RAM
Loading 'layer 14 v' into RAM
Loading 'layer 15 q' into RAM
Loading 'layer 15 k' into RAM
Loading 'layer 15 v' into RAM
Loading 'layer 16 q' into RAM
Loading 'layer 16 k' into RAM
Loading 'layer 16 v' into RAM
Loading 'layer 17 q' into RAM
Loading 'layer 17 k' into RAM
Loading 'layer 17 v' into RAM
Loading 'layer 18 q' into RAM
Loading 'layer 18 k' into RAM
Loading 'layer 18 v' into RAM
Loading 'layer 19 q' into RAM
Loading 'layer 19 k' into RAM
Loading 'layer 19 v' into RAM
Loading 'layer 20 q' into RAM
Loading 'layer 20 k' into RAM
Loading 'layer 20 v' into RAM
Loading 'layer 21 q' into RAM
Loading 'layer 21 k' into RAM
Loading 'layer 21 v' into RAM
Loading 'layer 22 q' into RAM
Loading 'layer 22 k' into RAM
Loading 'layer 22 v' into RAM
Processing checkpoints/mistralai/Mistral-7B-Instruct-v0.1/pytorch_model-00002-of-00002.bin
Loading 'model.layers.22.mlp.gate_proj.weight' into RAM
Loading 'model.layers.22.mlp.up_proj.weight' into RAM
Loading 'model.layers.22.mlp.down_proj.weight' into RAM
Loading 'model.layers.22.input_layernorm.weight' into RAM
Loading 'model.layers.22.post_attention_layernorm.weight' into RAM
Loading 'model.layers.23.self_attn.o_proj.weight' into RAM
Loading 'model.layers.23.mlp.gate_proj.weight' into RAM
Loading 'model.layers.23.mlp.up_proj.weight' into RAM
Loading 'model.layers.23.mlp.down_proj.weight' into RAM
Loading 'model.layers.23.input_layernorm.weight' into RAM
Loading 'model.layers.23.post_attention_layernorm.weight' into RAM
Loading 'model.layers.24.self_attn.o_proj.weight' into RAM
Loading 'model.layers.24.mlp.gate_proj.weight' into RAM
Loading 'model.layers.24.mlp.up_proj.weight' into RAM
Loading 'model.layers.24.mlp.down_proj.weight' into RAM
Loading 'model.layers.24.input_layernorm.weight' into RAM
Loading 'model.layers.24.post_attention_layernorm.weight' into RAM
Loading 'model.layers.25.self_attn.o_proj.weight' into RAM
Loading 'model.layers.25.mlp.gate_proj.weight' into RAM
Loading 'model.layers.25.mlp.up_proj.weight' into RAM
Loading 'model.layers.25.mlp.down_proj.weight' into RAM
Loading 'model.layers.25.input_layernorm.weight' into RAM
Loading 'model.layers.25.post_attention_layernorm.weight' into RAM
Loading 'model.layers.26.self_attn.o_proj.weight' into RAM
Loading 'model.layers.26.mlp.gate_proj.weight' into RAM
Loading 'model.layers.26.mlp.up_proj.weight' into RAM
Loading 'model.layers.26.mlp.down_proj.weight' into RAM
Loading 'model.layers.26.input_layernorm.weight' into RAM
Loading 'model.layers.26.post_attention_layernorm.weight' into RAM
Loading 'model.layers.27.self_attn.o_proj.weight' into RAM
Loading 'model.layers.27.mlp.gate_proj.weight' into RAM
Loading 'model.layers.27.mlp.up_proj.weight' into RAM
Loading 'model.layers.27.mlp.down_proj.weight' into RAM
Loading 'model.layers.27.input_layernorm.weight' into RAM
Loading 'model.layers.27.post_attention_layernorm.weight' into RAM
Loading 'model.layers.28.self_attn.o_proj.weight' into RAM
Loading 'model.layers.28.mlp.gate_proj.weight' into RAM
Loading 'model.layers.28.mlp.up_proj.weight' into RAM
Loading 'model.layers.28.mlp.down_proj.weight' into RAM
Loading 'model.layers.28.input_layernorm.weight' into RAM
Loading 'model.layers.28.post_attention_layernorm.weight' into RAM
Loading 'model.layers.29.self_attn.o_proj.weight' into RAM
Loading 'model.layers.29.mlp.gate_proj.weight' into RAM
Loading 'model.layers.29.mlp.up_proj.weight' into RAM
Loading 'model.layers.29.mlp.down_proj.weight' into RAM
Loading 'model.layers.29.input_layernorm.weight' into RAM
Loading 'model.layers.29.post_attention_layernorm.weight' into RAM
Loading 'model.layers.30.self_attn.o_proj.weight' into RAM
Loading 'model.layers.30.mlp.gate_proj.weight' into RAM
Loading 'model.layers.30.mlp.up_proj.weight' into RAM
Loading 'model.layers.30.mlp.down_proj.weight' into RAM
Loading 'model.layers.30.input_layernorm.weight' into RAM
Loading 'model.layers.30.post_attention_layernorm.weight' into RAM
Loading 'model.layers.31.self_attn.o_proj.weight' into RAM
Loading 'model.layers.31.mlp.gate_proj.weight' into RAM
Loading 'model.layers.31.mlp.up_proj.weight' into RAM
Loading 'model.layers.31.mlp.down_proj.weight' into RAM
Loading 'model.layers.31.input_layernorm.weight' into RAM
Loading 'model.layers.31.post_attention_layernorm.weight' into RAM
Loading 'model.norm.weight' into RAM
Loading 'lm_head.weight' into RAM
Loading 'layer 23 q' into RAM
Loading 'layer 23 k' into RAM
Loading 'layer 23 v' into RAM
Loading 'layer 24 q' into RAM
Loading 'layer 24 k' into RAM
Loading 'layer 24 v' into RAM
Loading 'layer 25 q' into RAM
Loading 'layer 25 k' into RAM
Loading 'layer 25 v' into RAM
Loading 'layer 26 q' into RAM
Loading 'layer 26 k' into RAM
Loading 'layer 26 v' into RAM
Loading 'layer 27 q' into RAM
Loading 'layer 27 k' into RAM
Loading 'layer 27 v' into RAM
Loading 'layer 28 q' into RAM
Loading 'layer 28 k' into RAM
Loading 'layer 28 v' into RAM
Loading 'layer 29 q' into RAM
Loading 'layer 29 k' into RAM
Loading 'layer 29 v' into RAM
Loading 'layer 30 q' into RAM
Loading 'layer 30 k' into RAM
Loading 'layer 30 v' into RAM
Loading 'layer 31 q' into RAM
Loading 'layer 31 k' into RAM
Loading 'layer 31 v' into RAM
Saving converted checkpoint to checkpoints/mistralai/Mistral-7B-Instruct-v0.1
Loading model 'checkpoints/mistralai/Mistral-7B-Instruct-v0.1/lit_model.pth' with {'name': 'Mistral-7B-Instruct-v0.1', 'hf_config': {'name': 'Mistral-7B-Instruct-v0.1', 'org': 'mistralai'}, 'scale_embeddings': False, 'block_size': 4096, 'vocab_size': 32000, 'padding_multiple': 512, 'padded_vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'head_size': 128, 'n_embd': 4096, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 8, 'shared_attention_norm': False, 'norm_class_name': 'RMSNorm', 'norm_eps': 1e-05, 'mlp_class_name': 'LLaMAMLP', 'gelu_approximate': 'none', 'intermediate_size': 14336, 'rope_condense_ratio': 1, 'rope_base': 10000, 'n_expert': 0, 'n_expert_per_token': 0, 'rope_n_elem': 128}
Time to instantiate model: 0.26 seconds.
Traceback (most recent call last):
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/bin/litgpt", line 8, in <module>
sys.exit(main())
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/litgpt/__main__.py", line 143, in main
fn(**kwargs)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/litgpt/generate/base.py", line 169, in main
model = fabric.setup_module(model)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 310, in setup_module
module = self._move_model_to_device(model=module, optimizers=[])
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 997, in _move_model_to_device
model = self.to_device(model)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 528, in to_device
self._strategy.module_to_device(obj)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/lightning/fabric/strategies/single_device.py", line 59, in module_to_device
module.to(self.root_device)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to
return self._apply(convert)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply
param_applied = fn(param)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 324, in to
return self._quantize(device)
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 289, in _quantize
w_4bit, quant_state = bnb.functional.quantize_4bit(
File "/home/anindya/workspace/benchmarks/bench_lightning/venv/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1234, in quantize_4bit
raise ValueError(f"Blockwise quantization only supports 16/32-bit floats, but got {A.dtype}")
ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
@carmocca Could you look into it? Since I don't have access to the model, I cannot even reproduce the issue.
Folks.. is there any outcome so far? I see the issue in hanging more than month. I can confirm that I met the same.. just with Llama3-8B Curious that I experimented on my home PC with RTX A2000 and absolutely same step by step process finally let me finetune (or generate or chat) with --quantize bnb.nf4.. just I couldn't contunie as 6GB of VRAM on my GPU is obviously not enough. But then I do the same in Lightning Studio (with either GPU - T1, L1 or A10G) it always causes that error from bitsandbytes.
config.json in both cases is the same
{ "architectures": [ "LlamaForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 128000, "eos_token_id": 128001, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 14336, "max_position_embeddings": 8192, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "num_key_value_heads": 8, "pretraining_tp": 1, "rms_norm_eps": 1e-05, "rope_scaling": null, "rope_theta": 500000.0, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.40.0.dev0", "use_cache": true, "vocab_size": 128256 }
SOLVED!.. the issue is in bnb version.. 0.43.x cause the error. downgraded back to 0.42.0 and all get fine.
Glad to hear that solves it. I remember we pinned the bitsandbytes version due to some issues, but I didn't recall exactly what these were.
https://github.com/Lightning-AI/litgpt/blob/9538d6a8194b6204601dea7eb10bc24c69678494/pyproject.toml#L36
I just don't understand how this version of bnb was installed if we pinned 0.42.0 in pypropject.toml
Some people manually upgrade bnb after installing LitGPT. E.g. this also happened to @t-vi . I am currently working on a patch that raises a warning in this case.