QLoRA seems to be broken
Bug description
Either I'm doing something dumb or QLoRA seems to be broken. Tried it with different models:
LoRA (fine)
gemma_2 ~/litgpt litgpt finetune_lora --devices 1 --config config_hub/finetune/gemma-2b/lora.yaml
{'access_token': None,
'checkpoint_dir': PosixPath('checkpoints/google/gemma-2b'),
'data': Alpaca2k(mask_prompt=False,
val_split_fraction=0.03847,
prompt_style=<litgpt.prompts.Alpaca object at 0x7fae9a9a2140>,
ignore_index=-100,
seed=42,
num_workers=4,
download_dir=PosixPath('data/alpaca2k')),
'devices': 1,
'eval': EvalArgs(interval=25,
max_new_tokens=100,
max_iters=100,
initial_validation=False,
final_validation=True),
'logger_name': 'csv',
'lora_alpha': 16,
'lora_dropout': 0.1,
'lora_head': True,
'lora_key': True,
'lora_mlp': True,
'lora_projection': True,
'lora_query': True,
'lora_r': 8,
'lora_value': True,
'num_nodes': 1,
'optimizer': {'class_path': 'torch.optim.AdamW',
'init_args': {'betas': [0.9, 0.95],
'lr': 0.0002,
'weight_decay': 0.0}},
'out_dir': PosixPath('out/finetune/lora-gemma-2b'),
'precision': 'bf16-true',
'quantize': None,
'seed': 1337,
'train': TrainArgs(save_interval=800,
log_interval=1,
global_batch_size=6,
micro_batch_size=2,
lr_warmup_steps=200,
lr_warmup_fraction=None,
epochs=2,
max_tokens=None,
max_steps=None,
max_seq_length=512,
tie_embeddings=None,
max_norm=None,
min_lr=6e-05)}
Seed set to 1337
Number of trainable parameters: 11,870,208
Number of non-trainable parameters: 3,030,460,416
The longest sequence length in the train data is 512, the model's maximum sequence length is 512 and context length is 4096
Verifying settings ...
Missing logger folder: /teamspace/studios/this_studio/out/finetune/lora-gemma-2b/logs/csv
Epoch 1 | iter 1 step 0 | loss train: 115.482, val: n/a | iter time: 753.85 ms
Epoch 1 | iter 2 step 0 | loss train: 106.427, val: n/a | iter time: 381.31 ms
Epoch 1 | iter 3 step 1 | loss train: 101.139, val: n/a | iter time: 351.09 ms (step)
Epoch 1 | iter 4 step 1 | loss train: 95.109, val: n/a | iter time: 167.29 ms
Epoch 1 | iter 5 step 1 | loss train: 98.440, val: n/a | iter time: 121.49 ms
Epoch 1 | iter 6 step 2 | loss train: 104.927, val: n/a | iter time: 182.25 ms (step)
QLoRA from config file (not fine)
gemma_2 ~/litgpt litgpt finetune_lora --devices 1 --config config_hub/finetune/gemma-2b/qlora.yaml
{'access_token': None,
'checkpoint_dir': PosixPath('checkpoints/google/gemma-2b'),
'data': Alpaca2k(mask_prompt=False,
val_split_fraction=0.03847,
prompt_style=<litgpt.prompts.Alpaca object at 0x7f4ae444efb0>,
ignore_index=-100,
seed=42,
num_workers=4,
download_dir=PosixPath('data/alpaca2k')),
'devices': 1,
'eval': EvalArgs(interval=25,
max_new_tokens=100,
max_iters=100,
initial_validation=False,
final_validation=True),
'logger_name': 'csv',
'lora_alpha': 16,
'lora_dropout': 0.1,
'lora_head': True,
'lora_key': True,
'lora_mlp': True,
'lora_projection': True,
'lora_query': True,
'lora_r': 16,
'lora_value': True,
'num_nodes': 1,
'optimizer': {'class_path': 'torch.optim.AdamW',
'init_args': {'betas': [0.9, 0.95],
'lr': 0.0002,
'weight_decay': 0.0}},
'out_dir': PosixPath('out/finetune/qlora-gemma-2b'),
'precision': 'bf16-true',
'quantize': 'bnb.nf4',
'seed': 1337,
'train': TrainArgs(save_interval=800,
log_interval=1,
global_batch_size=6,
micro_batch_size=2,
lr_warmup_steps=200,
lr_warmup_fraction=None,
epochs=2,
max_tokens=None,
max_steps=None,
max_seq_length=512,
tie_embeddings=None,
max_norm=None,
min_lr=6e-05)}
Seed set to 1337
Number of trainable parameters: 23,740,416
Number of non-trainable parameters: 3,030,460,416
Traceback (most recent call last):
File "/home/zeus/miniconda3/envs/cloudspace/bin/litgpt", line 8, in <module>
sys.exit(main())
File "/teamspace/studios/this_studio/litgpt/litgpt/__main__.py", line 71, in main
CLI(parser_data)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/jsonargparse/_cli.py", line 119, in CLI
return _run_component(component, init.get(subcommand))
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/jsonargparse/_cli.py", line 204, in _run_component
return component(**cfg)
File "/teamspace/studios/this_studio/litgpt/litgpt/finetune/lora.py", line 169, in setup
fabric.launch(main, devices, seed, config, data, checkpoint_dir, out_dir, train, eval, optimizer)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 845, in launch
return self._wrap_and_launch(function, self, *args, **kwargs)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 931, in _wrap_and_launch
return to_run(*args, **kwargs)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 936, in _wrap_with_setup
return to_run(*args, **kwargs)
File "/teamspace/studios/this_studio/litgpt/litgpt/finetune/lora.py", line 215, in main
load_checkpoint(fabric, model, checkpoint_path, strict=False)
File "/teamspace/studios/this_studio/litgpt/litgpt/utils.py", line 362, in load_checkpoint
model.load_state_dict(state_dict, strict=strict)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/wrappers.py", line 168, in load_state_dict
return self._original_module.load_state_dict(state_dict=state_dict, strict=strict, **kwargs)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2139, in load_state_dict
load(self, state_dict)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2127, in load
load(child, child_state_dict, child_prefix)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2127, in load
load(child, child_state_dict, child_prefix)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2121, in load
module._load_from_state_dict(
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1991, in _load_from_state_dict
hook(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 72, in __call__
return self.hook(*args, **kwargs)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/plugins/precision/bitsandbytes.py", line 166, in _quantize_on_load_hook
quantize_fn(weight)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/plugins/precision/bitsandbytes.py", line 320, in quantize_
if weight.data.dtype == torch.uint8:
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/utilities/load.py", line 166, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: '_NotYetLoadedTensor' object has no attribute 'data'
QLoRA without config file
gemma_2 ~/litgpt litgpt finetune_lora checkpoints/google/gemma-2b --devices 1 --quantize bnb.nf4 --precision bf16-true
{'access_token': None,
'checkpoint_dir': PosixPath('checkpoints/google/gemma-2b'),
'data': None,
'devices': 1,
'eval': EvalArgs(interval=100,
max_new_tokens=100,
max_iters=100,
initial_validation=False,
final_validation=True),
'logger_name': 'csv',
'lora_alpha': 16,
'lora_dropout': 0.05,
'lora_head': False,
'lora_key': False,
'lora_mlp': False,
'lora_projection': False,
'lora_query': True,
'lora_r': 8,
'lora_value': True,
'num_nodes': 1,
'optimizer': 'AdamW',
'out_dir': PosixPath('out/finetune/lora'),
'precision': 'bf16-true',
'quantize': 'bnb.nf4',
'seed': 1337,
'train': TrainArgs(save_interval=1000,
log_interval=1,
global_batch_size=16,
micro_batch_size=1,
lr_warmup_steps=100,
lr_warmup_fraction=None,
epochs=5,
max_tokens=None,
max_steps=None,
max_seq_length=None,
tie_embeddings=None,
max_norm=None,
min_lr=6e-05)}
Seed set to 1337
Number of trainable parameters: 921,600
Number of non-trainable parameters: 3,030,460,416
Traceback (most recent call last):
File "/home/zeus/miniconda3/envs/cloudspace/bin/litgpt", line 8, in <module>
sys.exit(main())
File "/teamspace/studios/this_studio/litgpt/litgpt/__main__.py", line 71, in main
CLI(parser_data)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/jsonargparse/_cli.py", line 119, in CLI
return _run_component(component, init.get(subcommand))
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/jsonargparse/_cli.py", line 204, in _run_component
return component(**cfg)
File "/teamspace/studios/this_studio/litgpt/litgpt/finetune/lora.py", line 169, in setup
fabric.launch(main, devices, seed, config, data, checkpoint_dir, out_dir, train, eval, optimizer)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 845, in launch
return self._wrap_and_launch(function, self, *args, **kwargs)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 931, in _wrap_and_launch
return to_run(*args, **kwargs)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 936, in _wrap_with_setup
return to_run(*args, **kwargs)
File "/teamspace/studios/this_studio/litgpt/litgpt/finetune/lora.py", line 215, in main
load_checkpoint(fabric, model, checkpoint_path, strict=False)
File "/teamspace/studios/this_studio/litgpt/litgpt/utils.py", line 362, in load_checkpoint
model.load_state_dict(state_dict, strict=strict)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/wrappers.py", line 168, in load_state_dict
return self._original_module.load_state_dict(state_dict=state_dict, strict=strict, **kwargs)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2139, in load_state_dict
load(self, state_dict)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2127, in load
load(child, child_state_dict, child_prefix)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2127, in load
load(child, child_state_dict, child_prefix)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2121, in load
module._load_from_state_dict(
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1991, in _load_from_state_dict
hook(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 72, in __call__
return self.hook(*args, **kwargs)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/plugins/precision/bitsandbytes.py", line 166, in _quantize_on_load_hook
quantize_fn(weight)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/plugins/precision/bitsandbytes.py", line 320, in quantize_
if weight.data.dtype == torch.uint8:
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/utilities/load.py", line 166, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: '_NotYetLoadedTensor' object has no attribute 'data'
What operating system are you using?
Unknown
LitGPT Version
litgpt 0.4.5 (Gemma 2 branch)
Not related to the Gemma 2 branch, also occurs in main.
Doesn't seem to be related to bitsandbytes and lightning fabric versions (issue also occurs with bnb 0.41.3 and lightning 0.2.2). Maybe something in LitGPT has changed.
Not only QLoRA. I tried to simply generate/chat in a new studio, fresh venv, code from master, pythia-1b model. The same error if quantization is applied.
I am not sure what's changed that could be causing this, we have bitsandbytes and lightning/fabric pinned.
It's caused by PyTorch-Lightning. Try:
pip install lightning==2.3.0.dev20240428
which is the package that the repo used before.
This kind of issues needs to be caught by tests.
Ohhh, so basically #1579. We can revert to an older version, but the question is whether there's something that needs to be updated in PyTorch-Lightning (in case this was an accidental change) or LitGPT (so that we can support newer PTL versions moving forward). Would appreciate your thoughts here @awaelchli
Added a quick PR to add a test and revert the lightning version until we have more time to investigate #1605
It's not really fixed. Downgrading the version is possible to avoid the problem, but isn't it conceivable that at some point LitGPT might want to support newer versions of Lightning? What happens then?
I think in such situations at least we should open a ticket on the library in question (lightning in this case). Plus the stack trace hints at bitsandbytes being involved, so we'd also need to collect the bnb version used. These are all essential steps that would help us resolve these issues efficiently.
Yes, I just realized this too and reopened a few seconds before you posted. Let me prepare an issue for the PyTorch Lightning issue tracker.
See issue: https://github.com/Lightning-AI/pytorch-lightning/issues/20119
With the fix https://github.com/Lightning-AI/pytorch-lightning/pull/20121 you can try updating the lightning package to the nightly produced next Sunday or once the next regular release is done.
Sounds great, thanks. I will make a reminder to test this on Sunday/Monday!