FastChat
FastChat copied to clipboard
ERROR:torch.distributed.elastic.multiprocessing.api:failed
trafficstars
command:
torchrun --nnodes=1 --nproc_per_node=1 --master_port=20001 fastchat/train/train_mem.py \
> --model_name_or_path /home/work/virtual-venv/fastchat-env/data/transformer_model_7b \
> --data_path playground/data/dummy.json \
> --fp16 True \
> --output_dir /home/work/virtual-venv/fastchat-env/data/vicuna-dummy \
> --num_train_epochs 2 \
> --per_device_train_batch_size 1 \
> --per_device_eval_batch_size 1 \
> --gradient_accumulation_steps 8 \
> --evaluation_strategy "no" \
> --save_strategy "steps" \
> --save_steps 300 \
> --save_total_limit 10 \
> --learning_rate 2e-5 \
> --weight_decay 0. \
> --warmup_ratio 0.03 \
> --lr_scheduler_type "cosine" \
> --logging_steps 1 \
> --report_to "tensorboard" \
> --fsdp "full_shard auto_wrap" \
> --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
> --model_max_length 2048 \
> --gradient_checkpointing True \
> --lazy_preprocess True
and the result:
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
**ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 27657) of binary:** /home/work/virtual-venv/fastchat-env/bin/python
Traceback (most recent call last):
File "/home/work/virtual-venv/fastchat-env/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/work/virtual-venv/fastchat-env/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/work/virtual-venv/fastchat-env/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/work/virtual-venv/fastchat-env/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/work/virtual-venv/fastchat-env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/work/virtual-venv/fastchat-env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
fastchat/train/train_mem.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-23_19:04:14
host : iZj6c9gnyket5dhojrds1tZ
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 27657)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 27657
======================================================
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0
how can is solve this problem?
@yuanconghao I solevd this error by increasing cpu memory.
how much the cpu and mem capacity?
mine is:
cpu 8core
MEM 32G
GPU nums 1 MEM 16Gib
@moseshu
16gb is very little to train a model. I am not sure you can without some quantization. In any case, did you manage it? Should we still look into this issue?
Closing this issue for now.