stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

finetuning with error

Open HaoBytes opened this issue 1 year ago • 6 comments

Hi everyone,

I tried to reproduce the finetuning of the alpaca, but I met follow error. Could you please help me?

Running command git clone --quiet https://github.com/huggingface/transformers /tmp/4267942.1.nvidiagpu.q/pip-req-build-317x2j5l
ERROR: file:///mnt/iusers01/fatpou01/compsci01/m32815hl/alpaca does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.
Traceback (most recent call last):
  File "/mnt/iusers01/fatpou01/compsci01/m32815hl/alpaca/train.py", line 231, in <module>
    train()
  File "/mnt/iusers01/fatpou01/compsci01/m32815hl/alpaca/train.py", line 194, in train
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 112, in __init__
  File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/transformers/training_args.py", line 1211, in __post_init__
    raise ValueError(
ValueError: Your setup doesn't support bf16/gpu. You need torch>=1.10, using Ampere GPU with cuda>=11.0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 79094) of binary: /mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/bin/python
Traceback (most recent call last):
  File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-07_17:00:37
  host      : node812.pri.csf3.alces.network
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 79094)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

my command is

pip install numpy
pip install git+https://github.com/huggingface/transformers
pip install -r requirements.txt
pip install -e .

python src/transformers/models/llama/convert_llama_weights_to_hf.py \
    --input_dir ../7B \
    --model_size 7B \
    --output_dir ../llama_7B_hf

torchrun --nproc_per_node=2 --master_port=2023 train.py \
    --model_name_or_path ./llama_7B_hf/llama-7b \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir ./alpaca/finetuned_alpaca_7B \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True

Thank you!

HaoBytes avatar Apr 07 '23 16:04 HaoBytes

Can you tell me what GPUs you are using? try this: change this with --bf16 True with --fp16 True and hopefully it will work.

Ahtesham00 avatar Apr 07 '23 22:04 Ahtesham00

Hi I'm using 2* 16G V100 Thanks I will try

HaoBytes avatar Apr 08 '23 11:04 HaoBytes

Hi Ahtesham00,

thank you for help! I followed your command but it still raise error

 Running command git clone --quiet https://github.com/huggingface/transformers /tmp/4269055.1.nvidiagpu.q/pip-req-build-nya7h15v
ERROR: file:///mnt/iusers01/fatpou01/compsci01/m32815hl/alpaca does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.
python: can't open file '/mnt/iusers01/fatpou01/compsci01/m32815hl/alpaca/src/transformers/models/llama/convert_llama_weights_to_hf.py': [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/mnt/iusers01/fatpou01/compsci01/m32815hl/alpaca/train.py", line 231, in <module>
    train()
  File "/mnt/iusers01/fatpou01/compsci01/m32815hl/alpaca/train.py", line 194, in train
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 112, in __init__
  File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/transformers/training_args.py", line 1299, in __post_init__
    raise ValueError("--tf32 requires Ampere or a newer GPU arch, cuda>=11 and torch>=1.7")
ValueError: --tf32 requires Ampere or a newer GPU arch, cuda>=11 and torch>=1.7
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 212588) of binary: /mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/bin/python
Traceback (most recent call last):
  File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-09_13:39:18
  host      : node803.pri.csf3.alces.network
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 212588)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

HaoBytes avatar Apr 09 '23 12:04 HaoBytes

maybe problem is version of torch and cuda. Follow the ValueError prompt.

xv994 avatar Apr 10 '23 02:04 xv994

@HarrywillDr v100 gpus does not support TF32, remove that tag

kir152 avatar Apr 11 '23 03:04 kir152

tf32 is supported by A100 and you are using V100

Ahtesham00 avatar Apr 11 '23 18:04 Ahtesham00