llm2vec
llm2vec copied to clipboard
Can't find a checkpoint index (pytorch_model.bin.index.json or model.safetensors.index.json)
The problem occurs in MNTP fine-tuning. (i.e., when running run_mntp.py)
When resuming from training from the checkpoint directory (i.e., by setting overwrite_output_dir to true), the following error occurs:
Traceback (most recent call last):
File "/N/scratch/user/llm2vec/experiments/run_mntp.py", line 1032, in <module>
main()
File "/N/scratch/user/llm2vec/experiments/run_mntp.py", line 980, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/N/u/user/Quartz/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 1910, in train
self._load_from_checkpoint(resume_from_checkpoint)
File "/N/u/user/Quartz/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 2625, in _load_from_checkpoint
load_result = load_sharded_checkpoint(
File "/N/u/user/Quartz/miniconda3/lib/python3.10/site-packages/transformers/modeling_utils.py", line 491, in load_sharded_checkpoint
raise ValueError(f"Can't find a checkpoint index ({' or '.join(filenames)}) in {folder}.")
ValueError: Can't find a checkpoint index (pytorch_model.bin.index.json or model.safetensors.index.json) in output/mntp/XXX/checkpoint-230000.
The error complains that the index file was not found. I have found a related issue, but no solution has been provided.
Thanks.