My first run of the trainer could not save the model because the evaluate() call fails. I have removed that method call and now would like to resume from the last checkpoint. However, I cannot seem to get that working. Is there some disparity between model architecture and checkpoint architecture? The change I made to accommodate checkpoint resumption and the error I get are shown below show below.
Change for checkpoint resumption**
`data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args)
trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
transformers.logging.set_verbosity_info()
#trainer.train()
trainer.train("output/checkpoint-18000")
#trainer.evaluate()
trainer.save_state()
safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)`
Error stacktrace.
Loading model from output/checkpoint-18000/.
Traceback (most recent call last):
File "/home/ubuntu/alpaca/stanford_alpaca/train.py", line 246, in
train()
File "/home/ubuntu/alpaca/stanford_alpaca/train.py", line 239, in train
trainer.train("output/checkpoint-18000/")
File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1617, in train
self._load_from_checkpoint(resume_from_checkpoint)
File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2120, in _load_from_checkpoint
load_result = load_sharded_checkpoint(model, resume_from_checkpoint, strict=is_sagemaker_mp_enabled())
File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 385, in load_sharded_checkpoint
state_dict = torch.load(os.path.join(folder, shard_file), map_location="cpu")
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/serialization.py", line 809, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/serialization.py", line 1172, in _load
result = unpickler.load()
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/_utils.py", line 169, in _rebuild_tensor_v2
tensor = _rebuild_tensor(storage, storage_offset, size, stride)
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/_utils.py", line 148, in rebuild_tensor
return t.set(storage._untyped_storage, storage_offset, size, stride)
RuntimeError: Trying to resize storage that is not resizable
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 122406 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 122407 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 122409 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 122408) of binary: /usr/local/bin/python3.10
Traceback (most recent call last):
File "/home/ubuntu/.local/bin/torchrun", line 8, in
sys.exit(main())
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: