GaLore icon indicating copy to clipboard operation
GaLore copied to clipboard

IndexError: tuple index out of range

Open zyushun opened this issue 9 months ago • 0 comments

Hi Jiawei,

I was trying Galore on TinyLlama-1B using the codebase https://github.com/jzhang38/TinyLlama on 4* A800-80GB. I encounter the following error:

[rank1]:     optimizer.step()
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/lightning/fabric/wrappers.py", line 74, in step
[rank1]:     output = self._strategy.optimizer_step(
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/lightning/fabric/strategies/strategy.py", line 207, in optimizer_step
[rank1]:     return self.precision.optimizer_step(optimizer, **kwargs)
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/lightning/fabric/plugins/precision/fsdp.py", line 142, in optimizer_step
[rank1]:     return super().optimizer_step(optimizer, **kwargs)
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/lightning/fabric/plugins/precision/precision.py", line 124, in optimizer_step
[rank1]:     return optimizer.step(**kwargs)
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank1]:     out = func(*args, **kwargs)
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/galore_torch/adamw.py", line 96, in step
[rank1]:     grad = state["projector"].project(grad, state["step"])
[rank1]:   File "/mntcephfs/lab_data/zhangyushun/anaconda/tinyllama/lib/python3.10/site-packages/galore_torch/galore_projector.py", line 15, in project
[rank1]:     if full_rank_grad.shape[0] >= full_rank_grad.shape[1]:
[rank1]: IndexError: tuple index out of range

I use galore as you suggested in torchrun_main.py:

print('using galore')
galore_params = []
target_modules_list = [ "attn", "mlp"]
for module_name, module in model.named_modules():
    if not isinstance(module, nn.Linear):
        continue

    if not any(target_key in module_name for target_key in target_modules_list):
        continue
    
    print('enable GaLore for weights in module: ', module_name)
    galore_params.append(module.weight)

id_galore_params = [id(p) for p in galore_params]

# make parameters without "rank" to another group
regular_params = [p for p in model.parameters() if id(p) not in id_galore_params]
# then call galore_adamw
param_groups = [{'params': regular_params}, 
                {'params': galore_params, 'rank': 128, 'update_proj_gap': 200, 'scale': 0.25, 'proj_type': 'std'}]
    
optimizer = GaLoreAdamW(param_groups, lr=learning_rate)

Any idea why and how to fix it?

Thanks in advance!

zyushun avatar May 13 '24 03:05 zyushun