axolotl
axolotl copied to clipboard
Can't resume from checkpoint for multi-node fine-tuning
Please check that this issue hasn't been reported before.
- [X] I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
I have two nodes and i want to do fine-tuning on them, first of all I tried to fine-tune on two nodes and it works perfectly. But when I tried to resume from a checkpoint, I couldn't the worker node can't find a valid checkpoint at the selected path.
node-1: master (1GPU) node-2: worker (1GPU) NOTE: No shared storage between the nodes
I run fine-tune successfully using these configuration on both nodes:
node-1: accelerate.config
compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: all machine_rank: 0 main_process_ip: MASTER-IP main_process_port: 12345 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
node-2: accelerate-config.yaml
compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: all machine_rank: 1 main_process_ip: MASTER-IP main_process_port: 12345 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
the fine-tune-config.yaml
base_model: openlm-research/open_llama_3b_v2 base_model_config: openlm-research/open_llama_3b_v2 model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer tokenizer_legacy: false is_llama_derived_model: true
load_in_8bit: false load_in_4bit: true strict: false
datasets:
- path: test/data.json type: sharegpt dataset_prepared_path: test/prepared-dataset val_set_size: 0.02 output_dir: test/model
adapter: qlora lora_model_dir:
sequence_len: 128 sample_packing: false pad_to_sequence_len: true
lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: lora_target_linear: true lora_fan_in_fan_out:
wandb_project: wandb_entity: wandb_watch: wandb_run_id: wandb_log_model:
gradient_accumulation_steps: 4 micro_batch_size: 2 num_epochs: 60 optimizer: paged_adamw_32bit lr_scheduler: cosine learning_rate: 0.0002
train_on_inputs: false group_by_length: false bf16: false fp16: true tf32: false
gradient_checkpointing: true early_stopping_patience: auto_resume_from_checkpoint: true local_rank: logging_steps: 1 xformers_attention: flash_attention:
warmup_steps: 2 eval_steps: 10 eval_table_size: save_steps: 5 debug: deepspeed: weight_decay: 0.0 fsdp: null fsdp_config: null special_tokens: bos_token: "
" eos_token: "" unk_token: "" tokens: null
on master node-1 all checkpoints are saved with model inside it, but for the worker node-2 it ha empty checkpoint
- Master node-1
node-1:~# ls test/model/ README.md adapter_config.json adapter_model.bin checkpoint-45 checkpoint-50 checkpoint-55 checkpoint-60 config.json special_tokens_map.json tokenizer.model tokenizer_config.json
node-1:~# ls test/model/checkpoint-45/ README.md adapter_config.json adapter_model.bin optimizer.pt rng_state_0.pth scheduler.pt trainer_state.json training_args.bin
- Worker node-2
node-12~# ls test/model/ README.md adapter_model.bin checkpoint-15 checkpoint-25 checkpoint-35 checkpoint-45 checkpoint-50 checkpoint-60 special_tokens_map.json tokenizer_config.json adapter_config.json checkpoint-10 checkpoint-20 checkpoint-30 checkpoint-40 checkpoint-5 checkpoint-55 config.json tokenizer.model
inside checkpoint:
node-2:~# ls test/model/checkpoint-45/ rng_state_1.pth
Resume from checkpoint on multi-nodes
for the previous fine-tuning, I need to again use the same configuration for all except the fine-tune-config.yaml i need to add resume from checkpoint
fine-tune-config.yaml
resume_from_checkpoint: test/model/checkpoint-45
What I have this error on worker:
[2023-11-21 14:25:52,690] [INFO] [axolotl.train.train:54] [PID:65] [RANK:0] loading model and (optionally) peft_config... [2023-11-21 14:26:03,872] [INFO] [axolotl.load_model:410] [PID:65] [RANK:0] GPU memory usage after model load: 1.967GB (+0.105GB cache, +0.610GB misc) [2023-11-21 14:26:03,876] [INFO] [axolotl.load_model:427] [PID:65] [RANK:0] converting PEFT model w/ prepare_model_for_kbit_training [2023-11-21 14:26:03,880] [INFO] [axolotl.load_model:438] [PID:65] [RANK:0] converting modules to torch.float16 for flash attention [2023-11-21 14:26:03,883] [INFO] [axolotl.load_lora:547] [PID:65] [RANK:0] found linear modules: ['o_proj', 'down_proj', 'k_proj', 'up_proj', 'v_proj', 'gate_proj', 'q_proj'] trainable params: 50,851,840 || all params: 3,477,325,440 || trainable%: 1.4623836876194136 [2023-11-21 14:26:04,610] [INFO] [axolotl.load_model:474] [PID:65] [RANK:0] GPU memory usage after adapters: 2.178GB (+0.771GB cache, +0.610GB misc) [2023-11-21 14:26:05,014] [INFO] [axolotl.train.train:82] [PID:65] [RANK:0] Pre-saving adapter config to test/model [2023-11-21 14:26:05,017] [INFO] [axolotl.train.train:106] [PID:65] [RANK:0] Starting trainer... /opt/conda/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( Traceback (most recent call last): File "/axolotl/scripts/finetune.py", line 54, in
fire.Fire(do_cli) File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/tools/axolotl/scripts/finetune.py", line 50, in do_cli train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta) File "/tools/axolotl/src/axolotl/train.py", line 116, in train trainer.train(resume_from_checkpoint=resume_from_checkpoint) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1531, in train self._load_from_checkpoint(resume_from_checkpoint) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2064, in _load_from_checkpoint raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}") ValueError: Can't find a valid checkpoint at test/model/checkpoint-45
and this error on master:
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
Expected behavior to resume successfully from checkpoint on multi-node fine-tuning
Current behaviour
Resume from checkpoint on multi-nodes
for the previous fine-tuning, I need to again use the same configuration for all except the fine-tune-config.yaml i need to add resume from checkpoint
fine-tune-config.yaml
resume_from_checkpoint: test/model/checkpoint-45
What I have this error on worker:
[2023-11-21 14:25:52,690] [INFO] [axolotl.train.train:54] [PID:65] [RANK:0] loading model and (optionally) peft_config... [2023-11-21 14:26:03,872] [INFO] [axolotl.load_model:410] [PID:65] [RANK:0] GPU memory usage after model load: 1.967GB (+0.105GB cache, +0.610GB misc) [2023-11-21 14:26:03,876] [INFO] [axolotl.load_model:427] [PID:65] [RANK:0] converting PEFT model w/ prepare_model_for_kbit_training [2023-11-21 14:26:03,880] [INFO] [axolotl.load_model:438] [PID:65] [RANK:0] converting modules to torch.float16 for flash attention [2023-11-21 14:26:03,883] [INFO] [axolotl.load_lora:547] [PID:65] [RANK:0] found linear modules: ['o_proj', 'down_proj', 'k_proj', 'up_proj', 'v_proj', 'gate_proj', 'q_proj'] trainable params: 50,851,840 || all params: 3,477,325,440 || trainable%: 1.4623836876194136 [2023-11-21 14:26:04,610] [INFO] [axolotl.load_model:474] [PID:65] [RANK:0] GPU memory usage after adapters: 2.178GB (+0.771GB cache, +0.610GB misc) [2023-11-21 14:26:05,014] [INFO] [axolotl.train.train:82] [PID:65] [RANK:0] Pre-saving adapter config to test/model [2023-11-21 14:26:05,017] [INFO] [axolotl.train.train:106] [PID:65] [RANK:0] Starting trainer... /opt/conda/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( Traceback (most recent call last): File "/axolotl/scripts/finetune.py", line 54, in
fire.Fire(do_cli) File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/tools/axolotl/scripts/finetune.py", line 50, in do_cli train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta) File "/tools/axolotl/src/axolotl/train.py", line 116, in train trainer.train(resume_from_checkpoint=resume_from_checkpoint) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1531, in train self._load_from_checkpoint(resume_from_checkpoint) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2064, in _load_from_checkpoint raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}") ValueError: Can't find a valid checkpoint at test/model/checkpoint-45
and this error on master:
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
Steps to reproduce
I have two nodes and i want to do fine-tuning on them, first of all I tried to fine-tune on two nodes and it works perfectly. But when I tried to resume from a checkpoint, I couldn't the worker node can't find a valid checkpoint at the selected path.
node-1: master (1GPU) node-2: worker (1GPU) NOTE: No shared storage between the nodes
I run fine-tune successfully using these configuration on both nodes:
node-1: accelerate.config
compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: all machine_rank: 0 main_process_ip: MASTER-IP main_process_port: 12345 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
node-2: accelerate-config.yaml
compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: all machine_rank: 1 main_process_ip: MASTER-IP main_process_port: 12345 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
the fine-tune-config.yaml
base_model: openlm-research/open_llama_3b_v2 base_model_config: openlm-research/open_llama_3b_v2 model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer tokenizer_legacy: false is_llama_derived_model: true
load_in_8bit: false load_in_4bit: true strict: false
datasets:
- path: test/data.json type: sharegpt dataset_prepared_path: test/prepared-dataset val_set_size: 0.02 output_dir: test/model
adapter: qlora lora_model_dir:
sequence_len: 128 sample_packing: false pad_to_sequence_len: true
lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: lora_target_linear: true lora_fan_in_fan_out:
wandb_project: wandb_entity: wandb_watch: wandb_run_id: wandb_log_model:
gradient_accumulation_steps: 4 micro_batch_size: 2 num_epochs: 60 optimizer: paged_adamw_32bit lr_scheduler: cosine learning_rate: 0.0002
train_on_inputs: false group_by_length: false bf16: false fp16: true tf32: false
gradient_checkpointing: true early_stopping_patience: auto_resume_from_checkpoint: true local_rank: logging_steps: 1 xformers_attention: flash_attention:
warmup_steps: 2 eval_steps: 10 eval_table_size: save_steps: 5 debug: deepspeed: weight_decay: 0.0 fsdp: null fsdp_config: null special_tokens: bos_token: "
" eos_token: "" unk_token: "" tokens: null
on master node-1 all checkpoints are saved with model inside it, but for the worker node-2 it ha empty checkpoint
- Master node-1
node-1:~# ls test/model/ README.md adapter_config.json adapter_model.bin checkpoint-45 checkpoint-50 checkpoint-55 checkpoint-60 config.json special_tokens_map.json tokenizer.model tokenizer_config.json
node-1:~# ls test/model/checkpoint-45/ README.md adapter_config.json adapter_model.bin optimizer.pt rng_state_0.pth scheduler.pt trainer_state.json training_args.bin
- Worker node-2
node-12~# ls test/model/ README.md adapter_model.bin checkpoint-15 checkpoint-25 checkpoint-35 checkpoint-45 checkpoint-50 checkpoint-60 special_tokens_map.json tokenizer_config.json adapter_config.json checkpoint-10 checkpoint-20 checkpoint-30 checkpoint-40 checkpoint-5 checkpoint-55 config.json tokenizer.model
inside checkpoint:
node-2:~# ls test/model/checkpoint-45/ rng_state_1.pth
Resume from checkpoint on multi-nodes
for the previous fine-tuning, I need to again use the same configuration for all except the fine-tune-config.yaml i need to add resume from checkpoint
fine-tune-config.yaml
resume_from_checkpoint: test/model/checkpoint-45
Config yaml
node-1: accelerate.config
compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: all machine_rank: 0 main_process_ip: MASTER-IP main_process_port: 12345 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
node-2: accelerate-config.yaml
compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: all machine_rank: 1 main_process_ip: MASTER-IP main_process_port: 12345 main_training_function: main mixed_precision: fp16 num_machines: 2 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
the fine-tune-config.yaml
base_model: openlm-research/open_llama_3b_v2 base_model_config: openlm-research/open_llama_3b_v2 model_type: LlamaForCausalLM tokenizer_type: LlamaTokenizer tokenizer_legacy: false is_llama_derived_model: true
load_in_8bit: false load_in_4bit: true strict: false
datasets:
- path: test/data.json type: sharegpt dataset_prepared_path: test/prepared-dataset val_set_size: 0.02 output_dir: test/model
adapter: qlora lora_model_dir:
sequence_len: 128 sample_packing: false pad_to_sequence_len: true
lora_r: 32 lora_alpha: 16 lora_dropout: 0.05 lora_target_modules: lora_target_linear: true lora_fan_in_fan_out:
wandb_project: wandb_entity: wandb_watch: wandb_run_id: wandb_log_model:
gradient_accumulation_steps: 4 micro_batch_size: 2 num_epochs: 60 optimizer: paged_adamw_32bit lr_scheduler: cosine learning_rate: 0.0002
train_on_inputs: false group_by_length: false bf16: false fp16: true tf32: false
gradient_checkpointing: true early_stopping_patience: auto_resume_from_checkpoint: true resume_from_checkpoint: test/model/checkpoint-45 local_rank: logging_steps: 1 xformers_attention: flash_attention:
warmup_steps: 2 eval_steps: 10 eval_table_size: save_steps: 5 debug: deepspeed: weight_decay: 0.0 fsdp: null fsdp_config: null special_tokens: bos_token: "
" eos_token: "" unk_token: "" tokens: null
Possible solution
No response
Which Operating Systems are you using?
- [X] Linux
- [ ] macOS
- [ ] Windows
Python Version
python3.10
axolotl branch-commit
a045db02146751548fec57a5d3f31382ce4e5959
Acknowledgements
- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.
Please update your axolotl version as this was fixed after the commit that you are using. #795 fixed this
the new version still has this issue
Marking as stale. Please report if this issue still persists.