DB-GPT-Hub
DB-GPT-Hub copied to clipboard
lora + ds train error: sh scripts/lora/lora_ds.sh
errors logs:
[INFO] date:2023-08-14 21:09:52
[W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[2023-08-14 21:09:57,184] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/models/WizardCoder-15B-V1.0
[2023-08-14 21:09:59,612] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-14 21:09:59,612] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-14 21:09:59,612] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
WARNING:root:Process rank: 0, device: cuda:0, n_gpu: 1
WARNING:root:distributed training: True, 16-bits training: False
WARNING:root:Training parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
cache_dir=None,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=scripts/ds_config/zero3_auto.json,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
full_finetune=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=8,
gradient_checkpointing=True,
greater_is_better=None,
group_by_length=True,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=adapter/runs/Aug14_21-09-59_vipdata-gpu-108-236.serving.ai.paas,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1,
logging_strategy=steps,
lr_scheduler_type=cosine,
max_grad_norm=0.3,
max_steps=10000,
metric_for_best_model=None,
model_max_length=2048,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_torch,
optim_args=None,
output_dir=adapter,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=4,
per_device_train_batch_size=4,
predict_with_generate=False,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=False,
report_to=['wandb'],
resume_from_checkpoint=None,
run_name=adapter,
sample_generate=False,
save_on_each_node=False,
save_safetensors=False,
save_steps=500,
save_strategy=steps,
save_total_limit=5,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
sortish_sampler=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
train_on_source=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.03,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
device_map: {'': 0}
Loading Model from /models/Baichuan-13B-Chat...
/home/chopin/miniconda3/envs/ft/lib/python3.10/site-packages/transformers/configuration_utils.py:483: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
/home/chopin/miniconda3/envs/ft/lib/python3.10/site-packages/transformers/modeling_utils.py:2193: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
Traceback (most recent call last):
File "/home/chopin/code/DB-GPT-Hub/train_lora.py", line 310, in <module>
train()
File "/home/chopin/code/DB-GPT-Hub/train_lora.py", line 261, in train
model, tokenizer = load_model_tokenizer(args=args)
File "/home/chopin/code/DB-GPT-Hub/train_lora.py", line 169, in load_model_tokenizer
model = AutoModelForCausalLM.from_pretrained(
File "/home/chopin/miniconda3/envs/ft/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 488, in from_pretrained
return model_class.from_pretrained(
File "/home/chopin/miniconda3/envs/ft/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2247, in from_pretrained
raise ValueError(
ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 881893) of binary: /usr/local/bin/python3.10
Traceback (most recent call last):
File "/home/chopin/.local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_lora.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-08-14_21:10:04
host : vipdata-gpu-108-236.serving.ai.paas
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 881893)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
finished
scripts:
CUDA_VISIBLE_DEVICES=3,4,5 torchrun --nproc_per_node=3 train_lora.py \
--model_name_or_path /models/Baichuan-13B-Chat \
--dataset_name spider \
--output_dir adapter \
--lora_target_modules W_pack \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 500 \
--save_total_limit 5 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--optim "adamw_torch" \
--lr_scheduler_type "cosine" \
--model_max_length 2048 \
--logging_steps 1 \
--do_train \
--do_eval \
--trust_remote_code \
--gradient_checkpointing True \
--deepspeed "scripts/ds_config/zero3_auto.json"
scripts/lora/lora.sh has problem ?
errors logs:
[INFO] date:2023-08-14 21:09:52 [W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol). [W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol). [W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol). [2023-08-14 21:09:57,184] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) /models/WizardCoder-15B-V1.0 [2023-08-14 21:09:59,612] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-08-14 21:09:59,612] [INFO] [comm.py:616:init_distributed] cdb=None [2023-08-14 21:09:59,612] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol). [W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol). WARNING:root:Process rank: 0, device: cuda:0, n_gpu: 1 WARNING:root:distributed training: True, 16-bits training: False WARNING:root:Training parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, cache_dir=None, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=scripts/ds_config/zero3_auto.json, disable_tqdm=False, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, full_finetune=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=8, gradient_checkpointing=True, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=<HUB_TOKEN>, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=adapter/runs/Aug14_21-09-59_vipdata-gpu-108-236.serving.ai.paas, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1, logging_strategy=steps, lr_scheduler_type=cosine, max_grad_norm=0.3, max_steps=10000, metric_for_best_model=None, model_max_length=2048, mp_parameters=, no_cuda=False, num_train_epochs=3.0, optim=adamw_torch, optim_args=None, output_dir=adapter, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=4, per_device_train_batch_size=4, predict_with_generate=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, ray_scope=last, remove_unused_columns=False, report_to=['wandb'], resume_from_checkpoint=None, run_name=adapter, sample_generate=False, save_on_each_node=False, save_safetensors=False, save_steps=500, save_strategy=steps, save_total_limit=5, seed=42, sharded_ddp=[], skip_memory_metrics=True, sortish_sampler=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, train_on_source=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) device_map: {'': 0} Loading Model from /models/Baichuan-13B-Chat... /home/chopin/miniconda3/envs/ft/lib/python3.10/site-packages/transformers/configuration_utils.py:483: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. warnings.warn( /home/chopin/miniconda3/envs/ft/lib/python3.10/site-packages/transformers/modeling_utils.py:2193: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. warnings.warn( Traceback (most recent call last): File "/home/chopin/code/DB-GPT-Hub/train_lora.py", line 310, in <module> train() File "/home/chopin/code/DB-GPT-Hub/train_lora.py", line 261, in train model, tokenizer = load_model_tokenizer(args=args) File "/home/chopin/code/DB-GPT-Hub/train_lora.py", line 169, in load_model_tokenizer model = AutoModelForCausalLM.from_pretrained( File "/home/chopin/miniconda3/envs/ft/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 488, in from_pretrained return model_class.from_pretrained( File "/home/chopin/miniconda3/envs/ft/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2247, in from_pretrained raise ValueError( ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 881893) of binary: /usr/local/bin/python3.10 Traceback (most recent call last): File "/home/chopin/.local/bin/torchrun", line 8, in <module> sys.exit(main()) File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train_lora.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-08-14_21:10:04 host : vipdata-gpu-108-236.serving.ai.paas rank : 0 (local_rank: 0) exitcode : 1 (pid: 881893) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ finished
scripts:
CUDA_VISIBLE_DEVICES=3,4,5 torchrun --nproc_per_node=1 train_lora.py \ --model_name_or_path /models/Baichuan-13B-Chat \ --dataset_name spider \ --output_dir adapter \ --lora_target_modules W_pack \ --num_train_epochs 3 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 500 \ --save_total_limit 5 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --optim "adamw_torch" \ --lr_scheduler_type "cosine" \ --model_max_length 2048 \ --logging_steps 1 \ --do_train \ --do_eval \ --trust_remote_code \ --gradient_checkpointing True \ --deepspeed "scripts/ds_config/zero3_auto.json"
scripts/lora/lora.sh has problem ?
scripts/lora/lora.sh has problem ?
errors logs:
[INFO] date:2023-08-14 21:09:52 [W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol). [W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol). [W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol). [2023-08-14 21:09:57,184] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) /models/WizardCoder-15B-V1.0 [2023-08-14 21:09:59,612] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-08-14 21:09:59,612] [INFO] [comm.py:616:init_distributed] cdb=None [2023-08-14 21:09:59,612] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol). [W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol). WARNING:root:Process rank: 0, device: cuda:0, n_gpu: 1 WARNING:root:distributed training: True, 16-bits training: False WARNING:root:Training parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, cache_dir=None, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=scripts/ds_config/zero3_auto.json, disable_tqdm=False, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, full_finetune=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=8, gradient_checkpointing=True, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=<HUB_TOKEN>, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=adapter/runs/Aug14_21-09-59_vipdata-gpu-108-236.serving.ai.paas, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1, logging_strategy=steps, lr_scheduler_type=cosine, max_grad_norm=0.3, max_steps=10000, metric_for_best_model=None, model_max_length=2048, mp_parameters=, no_cuda=False, num_train_epochs=3.0, optim=adamw_torch, optim_args=None, output_dir=adapter, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=4, per_device_train_batch_size=4, predict_with_generate=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, ray_scope=last, remove_unused_columns=False, report_to=['wandb'], resume_from_checkpoint=None, run_name=adapter, sample_generate=False, save_on_each_node=False, save_safetensors=False, save_steps=500, save_strategy=steps, save_total_limit=5, seed=42, sharded_ddp=[], skip_memory_metrics=True, sortish_sampler=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, train_on_source=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) device_map: {'': 0} Loading Model from /models/Baichuan-13B-Chat... /home/chopin/miniconda3/envs/ft/lib/python3.10/site-packages/transformers/configuration_utils.py:483: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. warnings.warn( /home/chopin/miniconda3/envs/ft/lib/python3.10/site-packages/transformers/modeling_utils.py:2193: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. warnings.warn( Traceback (most recent call last): File "/home/chopin/code/DB-GPT-Hub/train_lora.py", line 310, in <module> train() File "/home/chopin/code/DB-GPT-Hub/train_lora.py", line 261, in train model, tokenizer = load_model_tokenizer(args=args) File "/home/chopin/code/DB-GPT-Hub/train_lora.py", line 169, in load_model_tokenizer model = AutoModelForCausalLM.from_pretrained( File "/home/chopin/miniconda3/envs/ft/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 488, in from_pretrained return model_class.from_pretrained( File "/home/chopin/miniconda3/envs/ft/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2247, in from_pretrained raise ValueError( ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 881893) of binary: /usr/local/bin/python3.10 Traceback (most recent call last): File "/home/chopin/.local/bin/torchrun", line 8, in <module> sys.exit(main()) File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train_lora.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-08-14_21:10:04 host : vipdata-gpu-108-236.serving.ai.paas rank : 0 (local_rank: 0) exitcode : 1 (pid: 881893) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ finished
scripts:
CUDA_VISIBLE_DEVICES=3,4,5 torchrun --nproc_per_node=1 train_lora.py \ --model_name_or_path /models/Baichuan-13B-Chat \ --dataset_name spider \ --output_dir adapter \ --lora_target_modules W_pack \ --num_train_epochs 3 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 500 \ --save_total_limit 5 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --optim "adamw_torch" \ --lr_scheduler_type "cosine" \ --model_max_length 2048 \ --logging_steps 1 \ --do_train \ --do_eval \ --trust_remote_code \ --gradient_checkpointing True \ --deepspeed "scripts/ds_config/zero3_auto.json"
scripts/lora/lora.sh has problem ?
please help identify where the issue lies? I encountered some parameter-related issues when using the original script, so I modified these parameters: [--trust_remote_code , --dataset_name spider ] 。
I attempted to modify the 'train_lora.py' script and commented out the following line of code, and then the error was gone.
AutoModelForCausalLM.from_pretrained(
args.model_name_or_path,
# device_map=device_map,
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=compute_dtype,
)
if args.q_lora
else None,
torch_dtype=compute_dtype,
**config_kwargs,
)