DeepSeek-Coder icon indicating copy to clipboard operation
DeepSeek-Coder copied to clipboard

how to finetune in single gpu

Open sxsxsx opened this issue 1 year ago • 1 comments

cd finetune && deepspeed finetune_deepseekcoder.py --model_name_or_path $MODEL_PATH --data_path $DATA_PATH --output_dir $OUTPUT_PATH --num_train_epochs 3 --model_max_length 1024 --per_device_train_batch_size 16 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy "no" --save_strategy "steps" --save_steps 100 --save_total_limit 100 --learning_rate 2e-5 --warmup_steps 10 --logging_steps 1 --lr_scheduler_type "cosine" --gradient_checkpointing True --report_to "tensorboard" --deepspeed configs/ds_config_zero3.json --bf16 True

[2023-12-19 16:10:57,887] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/torch/cuda/init.py:138: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11070). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0 [2023-12-19 16:11:06,596] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-12-19 16:11:06,596] [INFO] [runner.py:570:main] cmd = /home/admin/miniconda3/envs/deepseek/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None finetune_deepseekcoder.py --model_name_or_path deepseek-ai/deepseek-coder-6.7b-instruct --data_path ../data/nickroshEvol-Instruct-Code-80k-v1/EvolInstruct-Code-80k.json --output_dir ./outputs --num_train_epochs 3 --model_max_length 1024 --per_device_train_batch_size 16 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 100 --save_total_limit 100 --learning_rate 2e-5 --warmup_steps 10 --logging_steps 1 --lr_scheduler_type cosine --gradient_checkpointing True --report_to tensorboard --deepspeed configs/ds_config_zero3.json --bf16 True [2023-12-19 16:11:12,734] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/torch/cuda/init.py:138: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11070). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0 [2023-12-19 16:11:16,782] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]} [2023-12-19 16:11:16,782] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-12-19 16:11:16,782] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-12-19 16:11:16,782] [INFO] [launch.py:163:main] dist_world_size=1 [2023-12-19 16:11:16,782] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0 /home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/torch/cuda/init.py:138: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11070). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0 [2023-12-19 16:11:28,688] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-19 16:11:30,064] [INFO] [comm.py:637:init_distributed] cdb=None [2023-12-19 16:11:30,065] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Traceback (most recent call last): File "/workspace/workdir/tevs_multi_idc_10g_20220825163730/lyq/DeepSeek-Coder/finetune/finetune_deepseekcoder.py", line 193, in train() File "/workspace/workdir/tevs_multi_idc_10g_20220825163730/lyq/DeepSeek-Coder/finetune/finetune_deepseekcoder.py", line 123, in train model_args, data_args, training_args = parser.parse_args_into_dataclasses() File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses obj = dtype(**inputs) File "", line 123, in init File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/transformers/training_args.py", line 1493, in post_init and (self.device.type != "cuda") File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/transformers/training_args.py", line 1941, in device return self._setup_devices File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/transformers/utils/generic.py", line 54, in get cached = self.fget(obj) File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/transformers/training_args.py", line 1867, in _setup_devices self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout)) File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/accelerate/state.py", line 183, in init dist.init_distributed(dist_backend=self.backend, auto_mpi_discovery=False, **kwargs) File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 670, in init_distributed cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size) File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 120, in init self.init_process_group(backend, timeout, init_method, rank, world_size) File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 146, in init_process_group torch.distributed.init_process_group(backend, File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper func_return = func(*args, **kwargs) File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1148, in init_process_group default_pg, _ = _new_process_group_helper( File "/home/admin/miniconda3/envs/deepseek/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1279, in _new_process_group_helper backend_class = ProcessGroupNCCL(backend_prefix_store, group_rank, group_size, pg_options) RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

sxsxsx avatar Dec 19 '23 16:12 sxsxsx

It seems your environment has no gpu device.

yh-xu avatar Feb 04 '24 09:02 yh-xu