GraphGPT
GraphGPT copied to clipboard
out of memory
在执行stag1.sh的时候,出现以下报错:
module._apply(fn)
File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 662, in _apply
param_applied = fn(param)
File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 985, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 1; 23.65 GiB total capacity; 23.08 GiB already allocated; 58.06 MiB free; 23.09 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 8396 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 8397) of binary: /root/miniconda3/envs/graphgpt/bin/python3
Traceback (most recent call last):
File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
main()
我看到了之前的issue中的回答,但并不清楚--gpu
的参数应当加在哪里。我尝试加在stage1.sh中,报错说这个参数不是代码需要的参数。想详细了解这个指示多gpu load参数的指令该如何添加?谢谢。
stage1.sh指令如下:
# to fill in the following path to run the first stage of our GraphGPT!
model_path=./vicuna-7b-v1.5-16k
instruct_ds=./data/train_instruct_graphmatch.json
graph_data_path=./graph_data/graph_data_all.pt
pretra_gnn=clip_gt_arxiv
output_model=./stage_1
wandb offline
python3 -m torch.distributed.run --nnodes=1 --nproc_per_node=2 --master_port=20001 \
graphgpt/train/train_mem.py \
--model_name_or_path ${model_path} \
--version v1 \
--data_path ${instruct_ds} \
--graph_content ./arxiv_ti_ab.json \
--graph_data_path ${graph_data_path} \
--graph_tower ${pretra_gnn} \
--tune_graph_mlp_adapter True \
--graph_select_layer -2 \
--use_graph_start_end \
--bf16 True \
--output_dir ${output_model} \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2400 \
--save_total_limit 1 \
--learning_rate 2e-3 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 False \
--model_max_length 256 \
--gradient_checkpointing True \
--lazy_preprocess True \
--report_to wandb