GraphGPT icon indicating copy to clipboard operation
GraphGPT copied to clipboard

out of memory

Open xxrrnn opened this issue 10 months ago • 3 comments

在执行stag1.sh的时候,出现以下报错:

    module._apply(fn)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 662, in _apply
    param_applied = fn(param)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 985, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 1; 23.65 GiB total capacity; 23.08 GiB already allocated; 58.06 MiB free; 23.09 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 8396 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 8397) of binary: /root/miniconda3/envs/graphgpt/bin/python3
Traceback (most recent call last):
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/graphgpt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()

我看到了之前的issue中的回答,但并不清楚--gpu的参数应当加在哪里。我尝试加在stage1.sh中,报错说这个参数不是代码需要的参数。想详细了解这个指示多gpu load参数的指令该如何添加?谢谢。 stage1.sh指令如下:

# to fill in the following path to run the first stage of our GraphGPT!
model_path=./vicuna-7b-v1.5-16k
instruct_ds=./data/train_instruct_graphmatch.json
graph_data_path=./graph_data/graph_data_all.pt
pretra_gnn=clip_gt_arxiv
output_model=./stage_1
wandb offline
python3 -m  torch.distributed.run  --nnodes=1 --nproc_per_node=2 --master_port=20001 \
    graphgpt/train/train_mem.py \
    --model_name_or_path ${model_path} \
    --version v1 \
    --data_path ${instruct_ds} \
    --graph_content ./arxiv_ti_ab.json \
    --graph_data_path ${graph_data_path} \
    --graph_tower ${pretra_gnn} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end \
    --bf16 True \
    --output_dir ${output_model} \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 256 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb

xxrrnn avatar Apr 01 '24 16:04 xxrrnn