litgpt RuntimeError: handle_0 INTERNAL ASSERT FAILED at "../c10/cuda/driver_api.cpp":15, please report a bug to PyTorch.

Been trying for some time now and always run into this error. Everything prior worked. What am I doing wrong? RTX3090 - 24go Windows 10 but on Ubuntu using wsl, maybe that's the problem but don't want to install Ubuntu on a new partition.

python3 finetune/adapter_v2.py --data_dir data/alpaca --checkpoint_dir checkpoints/tiiuae/falcon-7b --out_dir out/adapter/alpaca /usr/lib/python3/dist-packages/pkg_resources/init.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /usr/lib/python3/dist-packages/pkg_resources/init.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( Global seed set to 1337 Loading model 'checkpoints/tiiuae/falcon-7b/lit_model.pth' with {'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'n_query_groups': 1, 'shared_attention_norm': True, 'adapter_prompt_length': 10, 'adapter_start_layer': 2} Number of trainable parameters: 3839186 /usr/local/lib/python3.10/dist-packages/lightning/fabric/fabric.py:828: PossibleUserWarning: The model passed to Fabric.setup() has parameters on different devices. Since move_to_device=True, all parameters will be moved to the new device. If this is not desired, set Fabric.setup(..., move_to_device=False). rank_zero_warn( iter 0: loss 2.7154, time: 2929.28ms Traceback (most recent call last): File "/root/lit-parrot/finetune/adapter_v2.py", line 254, in CLI(main) File "/usr/local/lib/python3.10/dist-packages/jsonargparse/cli.py", line 85, in CLI return _run_component(component, cfg_init) File "/usr/local/lib/python3.10/dist-packages/jsonargparse/cli.py", line 147, in _run_component return component(**cfg) File "/root/lit-parrot/finetune/adapter_v2.py", line 90, in main train(fabric, model, optimizer, train_data, val_data, checkpoint_dir, out_dir) File "/root/lit-parrot/finetune/adapter_v2.py", line 126, in train logits = model(input_ids) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl return forward_call(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/lightning/fabric/wrappers.py", line 115, in forward output = self._forward_module(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in call_impl return forward_call(*args, **kwargs) File "/root/lit-parrot/lit_parrot/adapter.py", line 95, in forward x, * = block(x, (cos, sin), mask, max_seq_length) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl return forward_call(*args, **kwargs) File "/root/lit-parrot/lit_parrot/adapter.py", line 140, in forward h, new_kv_cache, new_adapter_kv_cache = self.attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl return forward_call(*args, **kwargs) File "/root/lit-parrot/lit_parrot/adapter.py", line 241, in forward y = y + self.gating_factor * ay RuntimeError: handle_0 INTERNAL ASSERT FAILED at "../c10/cuda/driver_api.cpp":15, please report a bug to PyTorch.

Jun 13 '23 14:06 ToDestiny

Do you get errors with other checkpoints?

If you have enough system RAM, you could try running one step on CPU. Even though it's very slow, it usually gives a better error message than when run on CUDA.

That is if there's an actual bug or issue. It might just be an issue with your driver installation. Have you properly followed the installation steps described in the README?

Jun 14 '23 02:06 carmocca

Hi, I met the same problem here, I'm using a 3070 with a Ubuntu 22.04 wsl2. I think it might be a wsl2 bug. Because I'm running a different model here(yolov7).

Jun 18 '23 18:06 Shadow-Alex

Same error here with 3070 Ubuntu 22.04 wsl2

Oct 09 '23 20:10 apekris