linux使用qloar微调遇到问题四个全部Killing subprocess，四张8G卡。out of memory该如何解决？

Open abandonnnnn opened this issue 1 year ago • 0 comments

`(newvisglm) zzz@zzz:~/yz/AllVscodes/VisualGLM-6B-main$ bash finetune/finetune_visualglm_qlora.sh NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 deepspeed --master_port 16666 --include localhost:0,1,2,3 --hostfile hostfile_single finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --layer_range 0 14 --pre_seq_len 4 --train-data ./fewshot-data/dataset.json --valid-data ./fewshot-data/dataset.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 1 --gradient-accumulation-steps 4 --skip-init --fp16 --use_qlora [2024-01-10 21:42:21,222] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so'), PosixPath('/home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward. Either way, this might cause trouble in the future: If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env. warn(msg) CUDA SETUP: CUDA runtime path found: /home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 6.1 CUDA SETUP: Detected CUDA version 113 /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU! warn(msg) CUDA SETUP: Loading binary /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so... [2024-01-10 21:43:44,628] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-01-10 21:43:44,628] [INFO] [runner.py:571:main] cmd = /home/zzz/anaconda3/envs/newvisglm/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=16666 --enable_each_rank_log=None finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --layer_range 0 14 --pre_seq_len 4 --train-data ./fewshot-data/dataset.json --valid-data ./fewshot-data/dataset.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 1 --gradient-accumulation-steps 4 --skip-init --fp16 --use_qlora [2024-01-10 21:43:46,326] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so'), PosixPath('/home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward. Either way, this might cause trouble in the future: If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env. warn(msg) CUDA SETUP: CUDA runtime path found: /home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 6.1 CUDA SETUP: Detected CUDA version 113 /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU! warn(msg) CUDA SETUP: Loading binary /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so... [2024-01-10 21:43:48,984] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=0 [2024-01-10 21:43:48,984] [INFO] [launch.py:138:main] 0 NCCL_DEBUG=info [2024-01-10 21:43:48,984] [INFO] [launch.py:138:main] 0 NCCL_NET_GDR_LEVEL=2 [2024-01-10 21:43:48,984] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]} [2024-01-10 21:43:48,984] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0 [2024-01-10 21:43:48,984] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]}) [2024-01-10 21:43:48,984] [INFO] [launch.py:163:main] dist_world_size=4 [2024-01-10 21:43:48,984] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 [2024-01-10 21:43:50,865] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-01-10 21:43:50,916] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-01-10 21:43:50,938] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-01-10 21:43:50,944] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so bin /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so'), PosixPath('/home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward. Either way, this might cause trouble in the future: If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env. warn(msg) CUDA SETUP: CUDA runtime path found: /home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 6.1 CUDA SETUP: Detected CUDA version 113 /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU! warn(msg) CUDA SETUP: Loading binary /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so... /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so'), PosixPath('/home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward. Either way, this might cause trouble in the future: If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env. warn(msg) CUDA SETUP: CUDA runtime path found: /home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 6.1 CUDA SETUP: Detected CUDA version 113 /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU! warn(msg) CUDA SETUP: Loading binary /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so... [2024-01-10 21:44:24,179] [INFO] using world size: 4 and model-parallel size: 1 [2024-01-10 21:44:24,179] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128) [2024-01-10 21:44:25,235] [INFO] [RANK 0] > initializing model parallel with size 1 [2024-01-10 21:44:25,276] [INFO] [RANK 0] You didn't pass in LOCAL_WORLD_SIZE environment variable. We use the guessed LOCAL_WORLD_SIZE=4. If this is wrong, please pass the LOCAL_WORLD_SIZE manually. [2024-01-10 21:44:25,285] [INFO] [comm.py:637:init_distributed] cdb=None [2024-01-10 21:44:25,285] [INFO] [comm.py:637:init_distributed] cdb=None [2024-01-10 21:44:25,286] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2024-01-10 21:44:25,287] [INFO] [comm.py:637:init_distributed] cdb=None [2024-01-10 21:44:25,288] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2024-01-10 21:44:25,288] [INFO] [checkpointing.py:1045:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False} [2024-01-10 21:44:25,288] [INFO] [checkpointing.py:227:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 [2024-01-10 21:44:25,289] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2024-01-10 21:44:25,295] [INFO] [comm.py:637:init_distributed] cdb=None [2024-01-10 21:44:25,296] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2024-01-10 21:44:25,422] [INFO] [RANK 0] building FineTuneVisualGLMModel model ... /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/torch/nn/init.py:403: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/torch/nn/init.py:403: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/torch/nn/init.py:403: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/torch/nn/init.py:403: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") [2024-01-10 21:44:44,798] [INFO] [RANK 0] replacing layer 0 attention with lora [2024-01-10 21:44:45,735] [INFO] [RANK 0] replacing layer 14 attention with lora [2024-01-10 21:44:46,718] [INFO] [RANK 0] replacing chatglm linear layer with 4bit [2024-01-10 21:48:30,236] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2432905 [2024-01-10 21:48:31,608] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2432906 [2024-01-10 21:48:32,635] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2432907 [2024-01-10 21:48:32,636] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2432908 [2024-01-10 21:48:33,659] [ERROR] [launch.py:321:sigkill_handler] ['/home/zzz/anaconda3/envs/newvisglm/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=3', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--layer_range', '0', '14', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '300', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '1', '--gradient-accumulation-steps', '4', '--skip-init', '--fp16', '--use_qlora'] exits with return code = -9 (newvisglm) zzz@zzz:~/yz/AllVscodes/VisualGLM-6B-main$ `

Jan 10 '24 14:01 abandonnnnn

VisualGLM-6B VisualGLM-6B copied to clipboard

linux使用qloar微调遇到问题四个全部Killing subprocess，四张8G卡。out of memory该如何解决？

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

VisualGLM-6B
VisualGLM-6B copied to clipboard