axolotl Can't resume from checkpoint for multi-node fine-tuning

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I have two nodes and i want to do fine-tuning on them, first of all I tried to fine-tune on two nodes and it works perfectly. But when I tried to resume from a checkpoint, I couldn't the worker node can't find a valid checkpoint at the selected path.

node-1: master (1GPU) node-2: worker (1GPU) NOTE: No shared storage between the nodes

I run fine-tune successfully using these configuration on both nodes:

node-1: accelerate.config

compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: all machine_rank: 0 main_process_ip: MASTER-IP main_process_port: 12345 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

node-2: accelerate-config.yaml

compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: all machine_rank: 1 main_process_ip: MASTER-IP main_process_port: 12345 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

the fine-tune-config.yaml

base_model: openlm-research/open_llama_3b_v2 base_model_config: openlm-research/open_llama_3b_v2 model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer tokenizer_legacy: false is_llama_derived_model: true

load_in_8bit: false load_in_4bit: true strict: false

datasets:

path: test/data.json type: sharegpt dataset_prepared_path: test/prepared-dataset val_set_size: 0.02 output_dir: test/model

adapter: qlora lora_model_dir:

sequence_len: 128 sample_packing: false pad_to_sequence_len: true

lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: lora_target_linear: true lora_fan_in_fan_out:

wandb_project: wandb_entity: wandb_watch: wandb_run_id: wandb_log_model:

gradient_accumulation_steps: 4 micro_batch_size: 2 num_epochs: 60 optimizer: paged_adamw_32bit lr_scheduler: cosine learning_rate: 0.0002

train_on_inputs: false group_by_length: false bf16: false fp16: true tf32: false

gradient_checkpointing: true early_stopping_patience: auto_resume_from_checkpoint: true local_rank: logging_steps: 1 xformers_attention: flash_attention:

warmup_steps: 2 eval_steps: 10 eval_table_size: save_steps: 5 debug: deepspeed: weight_decay: 0.0 fsdp: null fsdp_config: null special_tokens: bos_token: "~~" eos_token: "~~" unk_token: "" tokens: null

on master node-1 all checkpoints are saved with model inside it, but for the worker node-2 it ha empty checkpoint

Master node-1

node-1:~# ls test/model/ README.md adapter_config.json adapter_model.bin checkpoint-45 checkpoint-50 checkpoint-55 checkpoint-60 config.json special_tokens_map.json tokenizer.model tokenizer_config.json

node-1:~# ls test/model/checkpoint-45/ README.md adapter_config.json adapter_model.bin optimizer.pt rng_state_0.pth scheduler.pt trainer_state.json training_args.bin

Worker node-2

node-12~# ls test/model/ README.md adapter_model.bin checkpoint-15 checkpoint-25 checkpoint-35 checkpoint-45 checkpoint-50 checkpoint-60 special_tokens_map.json tokenizer_config.json adapter_config.json checkpoint-10 checkpoint-20 checkpoint-30 checkpoint-40 checkpoint-5 checkpoint-55 config.json tokenizer.model

inside checkpoint:

node-2:~# ls test/model/checkpoint-45/ rng_state_1.pth

Resume from checkpoint on multi-nodes

for the previous fine-tuning, I need to again use the same configuration for all except the fine-tune-config.yaml i need to add resume from checkpoint

fine-tune-config.yaml

resume_from_checkpoint: test/model/checkpoint-45

What I have this error on worker:

[2023-11-21 14:25:52,690] [INFO] [axolotl.train.train:54] [PID:65] [RANK:0] loading model and (optionally) peft_config... [2023-11-21 14:26:03,872] [INFO] [axolotl.load_model:410] [PID:65] [RANK:0] GPU memory usage after model load: 1.967GB (+0.105GB cache, +0.610GB misc) [2023-11-21 14:26:03,876] [INFO] [axolotl.load_model:427] [PID:65] [RANK:0] converting PEFT model w/ prepare_model_for_kbit_training [2023-11-21 14:26:03,880] [INFO] [axolotl.load_model:438] [PID:65] [RANK:0] converting modules to torch.float16 for flash attention [2023-11-21 14:26:03,883] [INFO] [axolotl.load_lora:547] [PID:65] [RANK:0] found linear modules: ['o_proj', 'down_proj', 'k_proj', 'up_proj', 'v_proj', 'gate_proj', 'q_proj'] trainable params: 50,851,840 || all params: 3,477,325,440 || trainable%: 1.4623836876194136 [2023-11-21 14:26:04,610] [INFO] [axolotl.load_model:474] [PID:65] [RANK:0] GPU memory usage after adapters: 2.178GB (+0.771GB cache, +0.610GB misc) [2023-11-21 14:26:05,014] [INFO] [axolotl.train.train:82] [PID:65] [RANK:0] Pre-saving adapter config to test/model [2023-11-21 14:26:05,017] [INFO] [axolotl.train.train:106] [PID:65] [RANK:0] Starting trainer... /opt/conda/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( Traceback (most recent call last): File "/axolotl/scripts/finetune.py", line 54, in fire.Fire(do_cli) File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/tools/axolotl/scripts/finetune.py", line 50, in do_cli train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta) File "/tools/axolotl/src/axolotl/train.py", line 116, in train trainer.train(resume_from_checkpoint=resume_from_checkpoint) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1531, in train self._load_from_checkpoint(resume_from_checkpoint) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2064, in _load_from_checkpoint raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}") ValueError: Can't find a valid checkpoint at test/model/checkpoint-45

and this error on master:

[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.

Expected behavior to resume successfully from checkpoint on multi-node fine-tuning

Current behaviour

Resume from checkpoint on multi-nodes

for the previous fine-tuning, I need to again use the same configuration for all except the fine-tune-config.yaml i need to add resume from checkpoint

fine-tune-config.yaml

resume_from_checkpoint: test/model/checkpoint-45

What I have this error on worker:

[2023-11-21 14:25:52,690] [INFO] [axolotl.train.train:54] [PID:65] [RANK:0] loading model and (optionally) peft_config... [2023-11-21 14:26:03,872] [INFO] [axolotl.load_model:410] [PID:65] [RANK:0] GPU memory usage after model load: 1.967GB (+0.105GB cache, +0.610GB misc) [2023-11-21 14:26:03,876] [INFO] [axolotl.load_model:427] [PID:65] [RANK:0] converting PEFT model w/ prepare_model_for_kbit_training [2023-11-21 14:26:03,880] [INFO] [axolotl.load_model:438] [PID:65] [RANK:0] converting modules to torch.float16 for flash attention [2023-11-21 14:26:03,883] [INFO] [axolotl.load_lora:547] [PID:65] [RANK:0] found linear modules: ['o_proj', 'down_proj', 'k_proj', 'up_proj', 'v_proj', 'gate_proj', 'q_proj'] trainable params: 50,851,840 || all params: 3,477,325,440 || trainable%: 1.4623836876194136 [2023-11-21 14:26:04,610] [INFO] [axolotl.load_model:474] [PID:65] [RANK:0] GPU memory usage after adapters: 2.178GB (+0.771GB cache, +0.610GB misc) [2023-11-21 14:26:05,014] [INFO] [axolotl.train.train:82] [PID:65] [RANK:0] Pre-saving adapter config to test/model [2023-11-21 14:26:05,017] [INFO] [axolotl.train.train:106] [PID:65] [RANK:0] Starting trainer... /opt/conda/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( Traceback (most recent call last): File "/axolotl/scripts/finetune.py", line 54, in fire.Fire(do_cli) File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/tools/axolotl/scripts/finetune.py", line 50, in do_cli train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta) File "/tools/axolotl/src/axolotl/train.py", line 116, in train trainer.train(resume_from_checkpoint=resume_from_checkpoint) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1531, in train self._load_from_checkpoint(resume_from_checkpoint) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2064, in _load_from_checkpoint raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}") ValueError: Can't find a valid checkpoint at test/model/checkpoint-45

and this error on master:

[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.

Steps to reproduce

I have two nodes and i want to do fine-tuning on them, first of all I tried to fine-tune on two nodes and it works perfectly. But when I tried to resume from a checkpoint, I couldn't the worker node can't find a valid checkpoint at the selected path.

node-1: master (1GPU) node-2: worker (1GPU) NOTE: No shared storage between the nodes

I run fine-tune successfully using these configuration on both nodes:

node-1: accelerate.config

compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: all machine_rank: 0 main_process_ip: MASTER-IP main_process_port: 12345 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

node-2: accelerate-config.yaml

compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: all machine_rank: 1 main_process_ip: MASTER-IP main_process_port: 12345 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

the fine-tune-config.yaml

base_model: openlm-research/open_llama_3b_v2 base_model_config: openlm-research/open_llama_3b_v2 model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer tokenizer_legacy: false is_llama_derived_model: true

load_in_8bit: false load_in_4bit: true strict: false

datasets:

path: test/data.json type: sharegpt dataset_prepared_path: test/prepared-dataset val_set_size: 0.02 output_dir: test/model

adapter: qlora lora_model_dir:

sequence_len: 128 sample_packing: false pad_to_sequence_len: true

lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: lora_target_linear: true lora_fan_in_fan_out:

wandb_project: wandb_entity: wandb_watch: wandb_run_id: wandb_log_model:

gradient_accumulation_steps: 4 micro_batch_size: 2 num_epochs: 60 optimizer: paged_adamw_32bit lr_scheduler: cosine learning_rate: 0.0002

train_on_inputs: false group_by_length: false bf16: false fp16: true tf32: false

gradient_checkpointing: true early_stopping_patience: auto_resume_from_checkpoint: true local_rank: logging_steps: 1 xformers_attention: flash_attention:

warmup_steps: 2 eval_steps: 10 eval_table_size: save_steps: 5 debug: deepspeed: weight_decay: 0.0 fsdp: null fsdp_config: null special_tokens: bos_token: "~~" eos_token: "~~" unk_token: "" tokens: null

on master node-1 all checkpoints are saved with model inside it, but for the worker node-2 it ha empty checkpoint

Master node-1

node-1:~# ls test/model/ README.md adapter_config.json adapter_model.bin checkpoint-45 checkpoint-50 checkpoint-55 checkpoint-60 config.json special_tokens_map.json tokenizer.model tokenizer_config.json

node-1:~# ls test/model/checkpoint-45/ README.md adapter_config.json adapter_model.bin optimizer.pt rng_state_0.pth scheduler.pt trainer_state.json training_args.bin

Worker node-2

node-12~# ls test/model/ README.md adapter_model.bin checkpoint-15 checkpoint-25 checkpoint-35 checkpoint-45 checkpoint-50 checkpoint-60 special_tokens_map.json tokenizer_config.json adapter_config.json checkpoint-10 checkpoint-20 checkpoint-30 checkpoint-40 checkpoint-5 checkpoint-55 config.json tokenizer.model

inside checkpoint:

node-2:~# ls test/model/checkpoint-45/ rng_state_1.pth

Resume from checkpoint on multi-nodes

for the previous fine-tuning, I need to again use the same configuration for all except the fine-tune-config.yaml i need to add resume from checkpoint

fine-tune-config.yaml

resume_from_checkpoint: test/model/checkpoint-45

Config yaml

node-1: accelerate.config

compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: all machine_rank: 0 main_process_ip: MASTER-IP main_process_port: 12345 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

node-2: accelerate-config.yaml

compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: all machine_rank: 1 main_process_ip: MASTER-IP main_process_port: 12345 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

the fine-tune-config.yaml

base_model: openlm-research/open_llama_3b_v2 base_model_config: openlm-research/open_llama_3b_v2 model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer tokenizer_legacy: false is_llama_derived_model: true

load_in_8bit: false load_in_4bit: true strict: false

datasets:

path: test/data.json type: sharegpt dataset_prepared_path: test/prepared-dataset val_set_size: 0.02 output_dir: test/model

adapter: qlora lora_model_dir:

sequence_len: 128 sample_packing: false pad_to_sequence_len: true

lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: lora_target_linear: true lora_fan_in_fan_out:

wandb_project: wandb_entity: wandb_watch: wandb_run_id: wandb_log_model:

gradient_accumulation_steps: 4 micro_batch_size: 2 num_epochs: 60 optimizer: paged_adamw_32bit lr_scheduler: cosine learning_rate: 0.0002

train_on_inputs: false group_by_length: false bf16: false fp16: true tf32: false

gradient_checkpointing: true early_stopping_patience: auto_resume_from_checkpoint: true resume_from_checkpoint: test/model/checkpoint-45 local_rank: logging_steps: 1 xformers_attention: flash_attention:

warmup_steps: 2 eval_steps: 10 eval_table_size: save_steps: 5 debug: deepspeed: weight_decay: 0.0 fsdp: null fsdp_config: null special_tokens: bos_token: "~~" eos_token: "~~" unk_token: "" tokens: null

Possible solution

No response

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

python3.10

axolotl branch-commit

a045db02146751548fec57a5d3f31382ce4e5959

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

Nov 21 '23 15:11 hahmad2008

Please update your axolotl version as this was fixed after the commit that you are using. #795 fixed this

Nov 21 '23 16:11 casper-hansen

the new version still has this issue

Oct 06 '24 12:10 michaellin99999

Marking as stale. Please report if this issue still persists.

May 14 '25 09:05 mhenrichsen

axolotl axolotl copied to clipboard

Can't resume from checkpoint for multi-node fine-tuning

Please check that this issue hasn't been reported before.

Expected Behavior

Resume from checkpoint on multi-nodes

Expected behavior to resume successfully from checkpoint on multi-node fine-tuning

Current behaviour

Resume from checkpoint on multi-nodes

Steps to reproduce

Resume from checkpoint on multi-nodes

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

axolotl
axolotl copied to clipboard