stanford_alpaca icon indicating copy to clipboard operation
stanford_alpaca copied to clipboard

encounter errors when I try to finetune the model

Open SleepEarlyLiveLong opened this issue 2 years ago • 2 comments

I encountered the following problem when finetuning the model with the guidance of README.md.

Here is the detailed error:

(alpaca) root@iZwz95ccn6prjs8ioz8bbdZ:/data/stanford_alpaca# sh order.sh WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/training_args.py:1462: FutureWarning: using --fsdp_transformer_layer_cls_to_wrap is deprecated. Use fsdp_config instead warnings.warn( /data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/training_args.py:1462: FutureWarning: using --fsdp_transformer_layer_cls_to_wrap is deprecated. Use fsdp_config instead warnings.warn( /data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/training_args.py:1462: FutureWarning: using --fsdp_transformer_layer_cls_to_wrap is deprecated. Use fsdp_config instead warnings.warn( /data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/training_args.py:1462: FutureWarning: using --fsdp_transformer_layer_cls_to_wrap is deprecated. Use fsdp_config instead warnings.warn( Downloading shards: 0%| | 0/33 [00:00<?, ?it/s] Traceback (most recent call last): File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/utils/hub.py", line 417, in cached_file resolved_file = hf_hub_download( File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(*args, **kwargs) File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1291, in hf_hub_download raise LocalEntryNotFoundError( huggingface_hub.utils._errors.LocalEntryNotFoundError: Connection error, and we cannot find the requested files in the disk cache. Please try again or make sure your Internet connection is on.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/data/stanford_alpaca/train.py", line 222, in train() File "/data/stanford_alpaca/train.py", line 186, in train model = transformers.AutoModelForCausalLM.from_pretrained( File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 467, in from_pretrained return model_class.from_pretrained( File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2523, in from_pretrained resolved_archive_file, sharded_metadata = get_checkpoint_shard_files( File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/utils/hub.py", line 934, in get_checkpoint_shard_files cached_filename = cached_file( File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/utils/hub.py", line 452, in cached_file raise EnvironmentError( OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like decapoda-research/llama-7b-hf is not the path to a directory containing a file named pytorch_model-00001-of-00033.bin. Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'. Downloading shards: 0%| | 0/33 [00:00<?, ?it/sWARNING:torch.distributed.elastic.multiprocessing.api:Sending process 25946 closing signal SIGTERM1.5M/405M [00:03<00:38, 9.72MB/s] WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 25948 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 25949 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 25947) of binary: /data/miniconda3/envs/alpaca/bin/python Traceback (most recent call last): File "/data/miniconda3/envs/alpaca/bin/torchrun", line 8, in sys.exit(main()) File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-06-05_21:03:54 host : iZwz95ccn6prjs8ioz8bbdZ rank : 1 (local_rank: 1) exitcode : 1 (pid: 25947) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

here is the order:

torchrun --nproc_per_node=4 --master_port=7788 train.py
--model_name_or_path decapoda-research/llama-7b-hf
--data_path ./alpaca_data.json
--bf16 True
--output_dir ./output
--num_train_epochs 3
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 2000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--fsdp "full_shard auto_wrap"
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
--tf32 True

here are some details of my machine:

image

(alpaca) root@iZwz95ccn6prjs8ioz8bbdZ:/data/stanford_alpaca# nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Sun_Feb_14_21:12:58_PST_2021 Cuda compilation tools, release 11.2, V11.2.152 Build cuda_11.2.r11.2/compiler.29618528_0

(alpaca) root@iZwz95ccn6prjs8ioz8bbdZ:/data/stanford_alpaca# conda list

packages in environment at /data/miniconda3/envs/alpaca:

Name Version Build Channel

_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
absl-py 1.4.0 pypi_0 pypi accelerate 0.19.0 pypi_0 pypi aiohttp 3.8.4 pypi_0 pypi aiosignal 1.3.1 pypi_0 pypi appdirs 1.4.4 pypi_0 pypi async-timeout 4.0.2 pypi_0 pypi attrs 23.1.0 pypi_0 pypi ca-certificates 2023.01.10 h06a4308_0
certifi 2023.5.7 pypi_0 pypi charset-normalizer 3.1.0 pypi_0 pypi click 8.1.3 pypi_0 pypi cmake 3.26.3 pypi_0 pypi docker-pycreds 0.4.0 pypi_0 pypi fairscale 0.4.13 pypi_0 pypi filelock 3.12.0 pypi_0 pypi fire 0.5.0 pypi_0 pypi frozenlist 1.3.3 pypi_0 pypi fsspec 2023.5.0 pypi_0 pypi gitdb 4.0.10 pypi_0 pypi gitpython 3.1.31 pypi_0 pypi huggingface-hub 0.15.1 pypi_0 pypi idna 3.4 pypi_0 pypi jinja2 3.1.2 pypi_0 pypi joblib 1.2.0 pypi_0 pypi ld_impl_linux-64 2.38 h1181459_1
libffi 3.4.4 h6a678d5_0
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libstdcxx-ng 11.2.0 h1234567_1
lit 16.0.5 pypi_0 pypi llama 0.0.0 dev_0 markupsafe 2.1.2 pypi_0 pypi mpmath 1.3.0 pypi_0 pypi multidict 6.0.4 pypi_0 pypi ncurses 6.4 h6a678d5_0
networkx 3.1 pypi_0 pypi nltk 3.8.1 pypi_0 pypi numpy 1.24.3 pypi_0 pypi nvidia-cublas-cu11 11.10.3.66 pypi_0 pypi nvidia-cuda-cupti-cu11 11.7.101 pypi_0 pypi nvidia-cuda-nvrtc-cu11 11.7.99 pypi_0 pypi nvidia-cuda-runtime-cu11 11.7.99 pypi_0 pypi nvidia-cudnn-cu11 8.5.0.96 pypi_0 pypi nvidia-cufft-cu11 10.9.0.58 pypi_0 pypi nvidia-curand-cu11 10.2.10.91 pypi_0 pypi nvidia-cusolver-cu11 11.4.0.1 pypi_0 pypi nvidia-cusparse-cu11 11.7.4.91 pypi_0 pypi nvidia-nccl-cu11 2.14.3 pypi_0 pypi nvidia-nvtx-cu11 11.7.91 pypi_0 pypi openai 0.27.7 pypi_0 pypi openssl 1.1.1t h7f8727e_0
packaging 23.1 pypi_0 pypi pathtools 0.1.2 pypi_0 pypi pip 23.0.1 py39h06a4308_0
protobuf 4.23.2 pypi_0 pypi psutil 5.9.5 pypi_0 pypi python 3.9.16 h7a1cb2a_2
pyyaml 6.0 pypi_0 pypi readline 8.2 h5eee18b_0
regex 2023.5.5 pypi_0 pypi requests 2.31.0 pypi_0 pypi rouge-score 0.1.2 pypi_0 pypi sentencepiece 0.1.99 pypi_0 pypi sentry-sdk 1.24.0 pypi_0 pypi setproctitle 1.3.2 pypi_0 pypi setuptools 67.8.0 py39h06a4308_0
six 1.16.0 pypi_0 pypi smmap 5.0.0 pypi_0 pypi sqlite 3.41.2 h5eee18b_0
sympy 1.12 pypi_0 pypi termcolor 2.3.0 pypi_0 pypi tk 8.6.12 h1ccaba5_0
tokenizers 0.13.3 pypi_0 pypi torch 2.0.1 pypi_0 pypi tqdm 4.65.0 pypi_0 pypi transformers 4.29.2 pypi_0 pypi triton 2.0.0 pypi_0 pypi typing-extensions 4.6.2 pypi_0 pypi tzdata 2023c h04d1e81_0
urllib3 1.26.16 pypi_0 pypi wandb 0.15.3 pypi_0 pypi wheel 0.38.4 py39h06a4308_0
xz 5.4.2 h5eee18b_0
yarl 1.9.2 pypi_0 pypi zlib 1.2.13 h5eee18b_0

And what is the problem of that bug? How can I fix it? THANKS A LOT!!

SleepEarlyLiveLong avatar Jun 05 '23 13:06 SleepEarlyLiveLong

I meet the same error

wyzhhhh avatar Jun 07 '23 12:06 wyzhhhh

I meet the same error

I solved the problem by updating the python from 3.9 to 3.10

SleepEarlyLiveLong avatar Jun 09 '23 03:06 SleepEarlyLiveLong