model
model copied to clipboard
Unable to continue training from checkpoint.
I am trying to run some more training loops for a specific region, using this notebook.
I was not happy with the clustering:
So I wanted to run a few epochs only on my target areag.
When I do so, with
!python trainer.py fit --trainer.max_epochs=100 \
--data.data_dir=data/chips \
--ckpt_path=data/checkpoints/Clay_v0.1_epoch-24_val-loss-0.46.ckpt
I get this error:
Seed set to 42
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[rank: 0] Seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
Total number of chips: 1102
/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:639: Checkpoint directory /home/brunosan/code/Clay/model/checkpoints exists and is not empty.
Restoring states from the checkpoint path at data/checkpoints/Clay_v0.1_epoch-24_val-loss-0.46.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
-------------------------------
0 | model | CLAY | 127 M
-------------------------------
127 M Trainable params
0 Non-trainable params
127 M Total params
510.809 Total estimated model params size (MB)
Traceback (most recent call last):
File "/home/brunosan/code/Clay/model/trainer.py", line 77, in <module>
cli_main()
File "/home/brunosan/code/Clay/model/trainer.py", line 64, in cli_main
cli = LightningCLI(
^^^^^^^^^^^^^
File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/cli.py", line 386, in __init__
self._run_subcommand(self.subcommand)
File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/cli.py", line 677, in _run_subcommand
fn(**fn_kwargs)
File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 102, in launch
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 980, in _run
self._checkpoint_connector.restore_training_state()
File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 296, in restore_training_state
self.restore_optimizers_and_schedulers()
File "/home/brunosan/miniforge3/envs/claymodel/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 362, in restore_optimizers_and_schedulers
raise KeyError(
KeyError: 'Trying to restore optimizer state but checkpoint contains only the model. This is probably due to `ModelCheckpoint.save_weights_only` being set to `True`.'