pytorch-lightning
pytorch-lightning copied to clipboard
openweb_trainer.py crashes after 6k iters
Hi all, when running openwebtext_trainer.py with default settings, the program crashes after 6k steps with following message:
(.venv) slkv@slkv-pc:~/Projects/project_brain$ python ./src/pretrain/openwebtext_trainer.py
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Seed set to 1337
{'model_name': 'pythia-70m', 'name': 'openwebtext', 'save_interval': 1000, 'eval_interval': 1000, 'eval_iters': 100, 'log_interval': 1, 'learning_rate': 0.0006, 'batch_size': 125, 'micro_batch_size': 5, 'gradient_accumulation_steps': 25, 'max_iters': 600000, 'weight_decay': 0.1, 'beta1': 0.9, 'beta2': 0.95, 'decay_lr': True, 'warmup_iters': 2000, 'lr_decay_iters': 600000, 'min_lr': 6e-05}
Loading model with {'name': 'pythia-70m', 'hf_config': {'org': 'EleutherAI', 'name': 'pythia-70m'}, 'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 128, 'padded_vocab_size': 50304, 'n_layer': 6, 'n_head': 8, 'n_embd': 512, 'rotary_percentage': 0.25, 'parallel_residual': True, 'bias': True, 'lm_head_bias': False, 'n_query_groups': 8, 'shared_attention_norm': False, '_norm_class': 'LayerNorm', 'norm_eps': 1e-05, '_mlp_class': 'GptNeoxMLP', 'gelu_approximate': 'none', 'intermediate_size': 2048, 'rope_condense_ratio': 1, 'rope_base': 10000, 'head_size': 64, 'rope_n_elem': 16}
Time to instantiate model: 0.00 seconds.
/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py:186: .fit(ckpt_path="last") is set, but there is no last checkpoint available. No checkpoint will be loaded.
/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:639: Checkpoint directory /home/slkv/Projects/project_brain/out/openwebtext exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
--------------------------------
0 | module | GPT | 70.4 M
--------------------------------
70.4 M Trainable params
0 Non-trainable params
70.4 M Total params
281.706 Total estimated model params size (MB)
Epoch 0: | | 6000/? [09:40<00:00, 10.34it/s, v_num=0, train_loss=4.810, val_loss=5.050Traceback (most recent call last): | 0/100 [00:00<?, ?it/s]
File "/home/slkv/Projects/project_brain/./src/pretrain/openwebtext_trainer.py", line 234, in <module>
CLI(main)
File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/jsonargparse/_cli.py", line 96, in CLI
return _run_component(components, cfg_init)
File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/jsonargparse/_cli.py", line 181, in _run_component
return component(**cfg)
File "/home/slkv/Projects/project_brain/./src/pretrain/openwebtext_trainer.py", line 191, in main
trainer.fit(model, train_dataloader, val_dataloader, ckpt_path="last")
File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit
call._call_and_handle_interrupt(
File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 988, in _run
results = self._run_stage()
File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1034, in _run_stage
self.fit_loop.run()
File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 204, in run
self.advance()
File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 360, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 139, in run
self.on_advance_end(data_fetcher)
File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 287, in on_advance_end
self.val_loop.run()
File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator
return loop_run(self, *args, **kwargs)
File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 135, in run
self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 410, in _evaluation_step
call._call_callback_hooks(trainer, hook_name, output, *hook_kwargs.values())
File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 208, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, **kwargs)
File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning_utilities/core/rank_zero.py", line 43, in wrapped_fn
return fn(*args, **kwargs)
File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/callbacks/throughput_monitor.py", line 193, in on_validation_batch_end
self._update(trainer, pl_module, batch, iter_num)
File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/pytorch/callbacks/throughput_monitor.py", line 146, in _update
throughput.update(
File "/home/slkv/Projects/project_brain/.venv/lib/python3.10/site-packages/lightning/fabric/utilities/throughput.py", line 143, in update
raise ValueError(f"Expected lengths ({lengths}) to be greater or equal than samples ({samples})")
ValueError: Expected lengths (2048) to be greater or equal than samples (2505)
Epoch 0: | | 6000/? [09:40<00:00, 10.33it/s, v_num=0, train_loss=4.810, val_loss=5.050]
Openweb data was generated using prepare_openwebtext.py. Do you maybe know what causes the error?
pretrain/openwebtext.py however works fine.
cc @carmocca
I'll take a look. Thanks for the report!
Is this fix in any way? I'm also encountering this bug, I've tried to add padding/truncate for my context length but the highest it will go is 8000 iter.
Transfering this to lightning since this file no longer exists, but there's still an underlying bug
milestone
May I ask have you address it? Or any quick fix for pretraining on opentext?