LMFlow Error in fine-tuning llama

报错日志如下，请问是因为网络原因？

05/28/2023 17:18:09 - WARNING - datasets.builder - Found cached dataset json (/home/lmw22/.cache/huggingface/datasets/json/default-356c98baf89317c6/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 'HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /gpt2/resolve/main/tokenizer_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f25e9dce430>, 'Connection to huggingface.co timed out. (connect timeout=10)'))' thrown while requesting HEAD https://huggingface.co/gpt2/resolve/main/tokenizer_config.json 05/28/2023 17:20:10 - WARNING - huggingface_hub.utils._http - 'HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /gpt2/resolve/main/tokenizer_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f25e9dce430>, 'Connection to huggingface.co timed out. (connect timeout=10)'))' thrown while requesting HEAD https://huggingface.co/gpt2/resolve/main/tokenizer_config.json 'HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /gpt2/resolve/main/config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f25e9dce700>, 'Connection to huggingface.co timed out. (connect timeout=10)'))' thrown while requesting HEAD https://huggingface.co/gpt2/resolve/main/config.json 05/28/2023 17:22:12 - WARNING - huggingface_hub.utils._http - 'HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /gpt2/resolve/main/config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f25e9dce700>, 'Connection to huggingface.co timed out. (connect timeout=10)'))' thrown while requesting HEAD https://huggingface.co/gpt2/resolve/main/config.json Traceback (most recent call last): File "/new_home/lmw22/LMFlow/transformers/src/transformers/utils/hub.py", line 409, in cached_file resolved_file = hf_hub_download( File "/home/lmw22/.conda/envs/lmflow/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn return fn(*args, **kwargs) File "/home/lmw22/.conda/envs/lmflow/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1291, in hf_hub_download raise LocalEntryNotFoundError( huggingface_hub.utils._errors.LocalEntryNotFoundError: Connection error, and we cannot find the requested files in the disk cache. Please try again or make sure your Internet connection is on.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/new_home/lmw22/LMFlow/examples/finetune.py", line 61, in main() File "/new_home/lmw22/LMFlow/examples/finetune.py", line 54, in main model = AutoModel.get_model(model_args) File "/new_home/lmw22/LMFlow/src/lmflow/models/auto_model.py", line 16, in get_model return HFDecoderModel(model_args, *args, **kwargs) File "/new_home/lmw22/LMFlow/src/lmflow/models/hf_decoder_model.py", line 124, in init tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs) File "/new_home/lmw22/LMFlow/transformers/src/transformers/models/auto/tokenization_auto.py", line 649, in from_pretrained config = AutoConfig.from_pretrained( File "/new_home/lmw22/LMFlow/transformers/src/transformers/models/auto/configuration_auto.py", line 908, in from_pretrained config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs) File "/new_home/lmw22/LMFlow/transformers/src/transformers/configuration_utils.py", line 573, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) File "/new_home/lmw22/LMFlow/transformers/src/transformers/configuration_utils.py", line 628, in _get_config_dict resolved_config_file = cached_file( File "/new_home/lmw22/LMFlow/transformers/src/transformers/utils/hub.py", line 443, in cached_file raise EnvironmentError( OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like gpt2 is not the path to a directory containing a file named config.json. Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'. [2023-05-28 17:22:14,047] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3026 [2023-05-28 17:22:14,048] [ERROR] [launch.py:324:sigkill_handler] ['/home/lmw22/.conda/envs/lmflow/bin/python', '-u', 'examples/finetune.py', '--local_rank=0', '--model_name_or_path', 'gpt2', '--dataset_path', '/home/lmw22/LMFlow/data/alpaca/train', '--output_dir', '/home/lmw22/LMFlow/output_models/finetune', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--block_size', '512', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--bf16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = 1

May 29 '23 05:05 LiuMingwu

Thanks for your interest in LMFlow! Yes, it is caused by network connection error during access to HF. If you have a server which have downloaded gpt2 before, you may copy the corresponding model under ~/.cache/huggingface/hub/ to your new server. Or you may try multiple times until it works. Hope that can solve the issue.

May 29 '23 11:05 research4pan

好的，感谢；我试着用本地已有的pinkmanlove/llama-7b-hf 进行微调，但是好像还存在问题 05/29/2023 20:22:49 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 05/29/2023 20:22:59 - WARNING - datasets.builder - Found cached dataset json (/home/lmw22/.cache/huggingface/datasets/json/default-356c98baf89317c6/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51) 'HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /pinkmanlove/llama-7b-hf/resolve/main/tokenizer_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f6ca4258490>, 'Connection to huggingface.co timed out. (connect timeout=10)'))' thrown while requesting HEAD https://huggingface.co/pinkmanlove/llama-7b-hf/resolve/main/tokenizer_config.json 05/29/2023 20:25:04 - WARNING - huggingface_hub.utils._http - 'HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /pinkmanlove/llama-7b-hf/resolve/main/tokenizer_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f6ca4258490>, 'Connection to huggingface.co timed out. (connect timeout=10)'))' thrown while requesting HEAD https://huggingface.co/pinkmanlove/llama-7b-hf/resolve/main/tokenizer_config.json [2023-05-29 20:25:20,435] [INFO] [partition_parameters.py:415:exit] finished initializing model with 6.74B parameters Loading checkpoint shards: 100%|??????????????????????????????????????????????????????????????????????????????????????| 2/2 [00:08<00:00, 4.43s/it] 05/29/2023 20:25:39 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/lmw22/.cache/huggingface/datasets/json/default-356c98baf89317c6/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-68471b592de9d64954461f59464cec24.arrow 05/29/2023 20:25:39 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/lmw22/.cache/huggingface/datasets/json/default-356c98baf89317c6/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-850203d3ebf344f1.arrow Using /new_home/lmw22/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /new_home/lmw22/.cache/torch_extensions/py39_cu117/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.362785577774048 seconds Using /new_home/lmw22/.cache/torch_extensions/py39_cu117 as PyTorch extensions root... Emitting ninja build file /new_home/lmw22/.cache/torch_extensions/py39_cu117/utils/build.ninja... Building extension module utils... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.1490616798400879 seconds Parameter Offload: Total persistent parameters: 266240 in 65 params [2023-05-29 20:26:43,838] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 17569 [2023-05-29 20:26:43,883] [ERROR] [launch.py:324:sigkill_handler] ['/home/lmw22/.conda/envs/lmflow/bin/python', '-u', 'examples/finetune.py', '--local_rank=0', '--model_name_or_path', 'pinkmanlove/llama-7b-hf', '--dataset_path', '/home/lmw22/LMFlow/data/alpaca/train', '--output_dir', '/home/lmw22/LMFlow/output_models/finetune', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--block_size', '512', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--bf16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = -9

May 29 '23 13:05 LiuMingwu

It is highly probable that the problem is caused by OOM. According to this article, theoretically a 7b model requires ~120G memory in total. Thus the GPU memory + CPU memory (RAM) should be at least 120G for it to work in practice. Deepspeed zero3 does offloading and can trade CPU memory for GPU memory, but the total amount cannot be reduced.

If the issue is truly caused by memory issue, you may try a smaller model, or use a server with larger CPU RAMs. Deepspeed also offers NVMe offloading to trade disk space for GPU memory, but it is reported to be slow under certain circumstances (see this link), so previous two options may be better.

Hope that answers your question 🙏

Jun 02 '23 08:06 research4pan

I am also having the connection error, " OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like bigscience/bloom-560m is not the path to a directory containing a file named config.json. Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'. " I am following the offline mode instruction : https://huggingface.co/docs/transformers/installation#offline-mode

For example, if I downloaded all files from https://huggingface.co/bigscience/bloom-560m/tree/main where should I put all the files? (how to match the local model_path in LMFlow ) Should I create a path as: ~/.cache/huggingface/hub/ since I do not see a folder as ".cache"

By the way, from the other project I was working on, the bloom-560m model was downloaded as shown below: which is quite different from the arrangement in https://huggingface.co/bigscience/bloom-560m/tree/main Please can you advice?

Jun 19 '23 10:06 tonymhl

the connection error is caused by the internet connection. You might want to download the model to the local path and use it locally.

Jul 09 '23 14:07 shizhediao

This issue has been marked as stale because it has not had recent activity. If you think this still needs to be addressed please feel free to reopen this issue. Thanks

Sep 30 '23 19:09 shizhediao