stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

Resuming from checkpoint

Open KurtFeynmanGodel opened this issue 1 year ago • 9 comments

My first run of the trainer could not save the model because the evaluate() call fails. I have removed that method call and now would like to resume from the last checkpoint. However, I cannot seem to get that working. Is there some disparity between model architecture and checkpoint architecture? The change I made to accommodate checkpoint resumption and the error I get are shown below show below.

Change for checkpoint resumption**

`data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args)

trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)

transformers.logging.set_verbosity_info()

#trainer.train()

trainer.train("output/checkpoint-18000")

#trainer.evaluate()

trainer.save_state()

safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)`

Error stacktrace.

Loading model from output/checkpoint-18000/. Traceback (most recent call last): File "/home/ubuntu/alpaca/stanford_alpaca/train.py", line 246, in train() File "/home/ubuntu/alpaca/stanford_alpaca/train.py", line 239, in train trainer.train("output/checkpoint-18000/") File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1617, in train self._load_from_checkpoint(resume_from_checkpoint) File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2120, in _load_from_checkpoint load_result = load_sharded_checkpoint(model, resume_from_checkpoint, strict=is_sagemaker_mp_enabled()) File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 385, in load_sharded_checkpoint state_dict = torch.load(os.path.join(folder, shard_file), map_location="cpu") File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/serialization.py", line 809, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/serialization.py", line 1172, in _load result = unpickler.load() File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/_utils.py", line 169, in _rebuild_tensor_v2 tensor = _rebuild_tensor(storage, storage_offset, size, stride) File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/_utils.py", line 148, in rebuild_tensor return t.set(storage._untyped_storage, storage_offset, size, stride) RuntimeError: Trying to resize storage that is not resizable WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 122406 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 122407 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 122409 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 122408) of binary: /usr/local/bin/python3.10 Traceback (most recent call last): File "/home/ubuntu/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

KurtFeynmanGodel avatar Mar 16 '23 14:03 KurtFeynmanGodel