llm2vec icon indicating copy to clipboard operation
llm2vec copied to clipboard

Can't find a checkpoint index (pytorch_model.bin.index.json or model.safetensors.index.json)

Open Mem2019 opened this issue 10 months ago • 0 comments

The problem occurs in MNTP fine-tuning. (i.e., when running run_mntp.py)

When resuming from training from the checkpoint directory (i.e., by setting overwrite_output_dir to true), the following error occurs:

Traceback (most recent call last):
  File "/N/scratch/user/llm2vec/experiments/run_mntp.py", line 1032, in <module>
    main()
  File "/N/scratch/user/llm2vec/experiments/run_mntp.py", line 980, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/N/u/user/Quartz/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 1910, in train
    self._load_from_checkpoint(resume_from_checkpoint)
  File "/N/u/user/Quartz/miniconda3/lib/python3.10/site-packages/transformers/trainer.py", line 2625, in _load_from_checkpoint
    load_result = load_sharded_checkpoint(
  File "/N/u/user/Quartz/miniconda3/lib/python3.10/site-packages/transformers/modeling_utils.py", line 491, in load_sharded_checkpoint
    raise ValueError(f"Can't find a checkpoint index ({' or '.join(filenames)}) in {folder}.")
ValueError: Can't find a checkpoint index (pytorch_model.bin.index.json or model.safetensors.index.json) in output/mntp/XXX/checkpoint-230000.

The error complains that the index file was not found. I have found a related issue, but no solution has been provided.

Thanks.

Mem2019 avatar Feb 20 '25 18:02 Mem2019