stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

Resuming from checkpoint

Open KurtFeynmanGodel opened this issue 2 years ago • 9 comments
trafficstars

My first run of the trainer could not save the model because the evaluate() call fails. I have removed that method call and now would like to resume from the last checkpoint. However, I cannot seem to get that working. Is there some disparity between model architecture and checkpoint architecture? The change I made to accommodate checkpoint resumption and the error I get are shown below show below.

Change for checkpoint resumption**

`data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args)

trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)

transformers.logging.set_verbosity_info()

#trainer.train()

trainer.train("output/checkpoint-18000")

#trainer.evaluate()

trainer.save_state()

safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)`

Error stacktrace.

Loading model from output/checkpoint-18000/. Traceback (most recent call last): File "/home/ubuntu/alpaca/stanford_alpaca/train.py", line 246, in train() File "/home/ubuntu/alpaca/stanford_alpaca/train.py", line 239, in train trainer.train("output/checkpoint-18000/") File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1617, in train self._load_from_checkpoint(resume_from_checkpoint) File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2120, in _load_from_checkpoint load_result = load_sharded_checkpoint(model, resume_from_checkpoint, strict=is_sagemaker_mp_enabled()) File "/home/ubuntu/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 385, in load_sharded_checkpoint state_dict = torch.load(os.path.join(folder, shard_file), map_location="cpu") File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/serialization.py", line 809, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/serialization.py", line 1172, in _load result = unpickler.load() File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/_utils.py", line 169, in _rebuild_tensor_v2 tensor = _rebuild_tensor(storage, storage_offset, size, stride) File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/_utils.py", line 148, in rebuild_tensor return t.set(storage._untyped_storage, storage_offset, size, stride) RuntimeError: Trying to resize storage that is not resizable WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 122406 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 122407 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 122409 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 122408) of binary: /usr/local/bin/python3.10 Traceback (most recent call last): File "/home/ubuntu/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

KurtFeynmanGodel avatar Mar 16 '23 14:03 KurtFeynmanGodel

"RuntimeError: Trying to resize storage that is not resizable" . I met same error . but do not know the solution

duongkstn avatar Mar 24 '23 04:03 duongkstn

Any luck on this issue? I'm facing the same problem.

bernaljg avatar Apr 10 '23 18:04 bernaljg

The same problem.

lw3259111 avatar Apr 11 '23 01:04 lw3259111

Same

DachengLi1 avatar Apr 11 '23 21:04 DachengLi1

I think there's a problem with the pytorch version. I returned to torch version 1.13.1, ran the code to save the model again and now it works.

bernaljg avatar Apr 11 '23 21:04 bernaljg

@bernaljg What is the version of your Pytorch,transformers,deepspeed,nvcc and cuda? My Pytorch version is 1.13.1+cuda11.7 ,transformers-4.28.0dev0

lw3259111 avatar Apr 12 '23 02:04 lw3259111

same problem

xiaolingzang avatar Apr 12 '23 14:04 xiaolingzang

@bernaljg What is the version of your Pytorch,transformers,deepspeed,nvcc and cuda? My Pytorch version is 1.13.1+cuda11.7 ,transformers-4.28.0dev0

torch->1.13.1+cuda11.6 transformers->4.27.4

bernaljg avatar Apr 12 '23 14:04 bernaljg

https://github.com/lm-sys/FastChat/issues/351#issuecomment-1519060027

sahalshajim avatar Apr 23 '23 12:04 sahalshajim