FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

ERROR:torch.distributed.elastic.multiprocessing.api:failed

Open yuanconghao opened this issue 2 years ago • 2 comments
trafficstars

command:

torchrun --nnodes=1 --nproc_per_node=1 --master_port=20001 fastchat/train/train_mem.py \
>     --model_name_or_path /home/work/virtual-venv/fastchat-env/data/transformer_model_7b  \
>     --data_path playground/data/dummy.json \
>     --fp16 True \
>     --output_dir /home/work/virtual-venv/fastchat-env/data/vicuna-dummy \
>     --num_train_epochs 2 \
>     --per_device_train_batch_size 1 \
>     --per_device_eval_batch_size 1 \
>     --gradient_accumulation_steps 8 \
>     --evaluation_strategy "no" \
>     --save_strategy "steps" \
>     --save_steps 300 \
>     --save_total_limit 10 \
>     --learning_rate 2e-5 \
>     --weight_decay 0. \
>     --warmup_ratio 0.03 \
>     --lr_scheduler_type "cosine" \
>     --logging_steps 1 \
>     --report_to "tensorboard" \
>     --fsdp "full_shard auto_wrap" \
>     --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
>     --model_max_length 2048 \
>     --gradient_checkpointing True \
>     --lazy_preprocess True

and the result:

Loading checkpoint shards:   0%|                                                                                                                             | 0/2 [00:00<?, ?it/s]
**ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 27657) of binary:** /home/work/virtual-venv/fastchat-env/bin/python
Traceback (most recent call last):
  File "/home/work/virtual-venv/fastchat-env/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/work/virtual-venv/fastchat-env/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/work/virtual-venv/fastchat-env/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/work/virtual-venv/fastchat-env/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/work/virtual-venv/fastchat-env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/work/virtual-venv/fastchat-env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
fastchat/train/train_mem.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-23_19:04:14
  host      : iZj6c9gnyket5dhojrds1tZ
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 27657)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 27657
======================================================

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0

how can is solve this problem?

yuanconghao avatar Apr 23 '23 11:04 yuanconghao

@yuanconghao I solevd this error by increasing cpu memory.

moseshu avatar Apr 24 '23 03:04 moseshu

how much the cpu and mem capacity?

mine is: cpu 8core
MEM 32G GPU nums 1 MEM 16Gib @moseshu

yuanconghao avatar Apr 24 '23 07:04 yuanconghao

16gb is very little to train a model. I am not sure you can without some quantization. In any case, did you manage it? Should we still look into this issue?

surak avatar Oct 21 '23 16:10 surak

Closing this issue for now.

infwinston avatar Oct 21 '23 17:10 infwinston