stanford_alpaca
stanford_alpaca copied to clipboard
finetuning with error
Hi everyone,
I tried to reproduce the finetuning of the alpaca, but I met follow error. Could you please help me?
Running command git clone --quiet https://github.com/huggingface/transformers /tmp/4267942.1.nvidiagpu.q/pip-req-build-317x2j5l
ERROR: file:///mnt/iusers01/fatpou01/compsci01/m32815hl/alpaca does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.
Traceback (most recent call last):
File "/mnt/iusers01/fatpou01/compsci01/m32815hl/alpaca/train.py", line 231, in <module>
train()
File "/mnt/iusers01/fatpou01/compsci01/m32815hl/alpaca/train.py", line 194, in train
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "<string>", line 112, in __init__
File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/transformers/training_args.py", line 1211, in __post_init__
raise ValueError(
ValueError: Your setup doesn't support bf16/gpu. You need torch>=1.10, using Ampere GPU with cuda>=11.0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 79094) of binary: /mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/bin/python
Traceback (most recent call last):
File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-07_17:00:37
host : node812.pri.csf3.alces.network
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 79094)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
my command is
pip install numpy
pip install git+https://github.com/huggingface/transformers
pip install -r requirements.txt
pip install -e .
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
--input_dir ../7B \
--model_size 7B \
--output_dir ../llama_7B_hf
torchrun --nproc_per_node=2 --master_port=2023 train.py \
--model_name_or_path ./llama_7B_hf/llama-7b \
--data_path ./alpaca_data.json \
--bf16 True \
--output_dir ./alpaca/finetuned_alpaca_7B \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 True
Thank you!
Can you tell me what GPUs you are using? try this: change this with --bf16 True with --fp16 True and hopefully it will work.
Hi I'm using 2* 16G V100 Thanks I will try
Hi Ahtesham00,
thank you for help! I followed your command but it still raise error
Running command git clone --quiet https://github.com/huggingface/transformers /tmp/4269055.1.nvidiagpu.q/pip-req-build-nya7h15v
ERROR: file:///mnt/iusers01/fatpou01/compsci01/m32815hl/alpaca does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.
python: can't open file '/mnt/iusers01/fatpou01/compsci01/m32815hl/alpaca/src/transformers/models/llama/convert_llama_weights_to_hf.py': [Errno 2] No such file or directory
Traceback (most recent call last):
File "/mnt/iusers01/fatpou01/compsci01/m32815hl/alpaca/train.py", line 231, in <module>
train()
File "/mnt/iusers01/fatpou01/compsci01/m32815hl/alpaca/train.py", line 194, in train
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/transformers/hf_argparser.py", line 332, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "<string>", line 112, in __init__
File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/transformers/training_args.py", line 1299, in __post_init__
raise ValueError("--tf32 requires Ampere or a newer GPU arch, cuda>=11 and torch>=1.7")
ValueError: --tf32 requires Ampere or a newer GPU arch, cuda>=11 and torch>=1.7
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 212588) of binary: /mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/bin/python
Traceback (most recent call last):
File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/iusers01/fatpou01/compsci01/m32815hl/.conda/envs/alpaca/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-09_13:39:18
host : node803.pri.csf3.alces.network
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 212588)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
maybe problem is version of torch and cuda. Follow the ValueError prompt.
@HarrywillDr v100 gpus does not support TF32, remove that tag
tf32 is supported by A100 and you are using V100