uuser0748
uuser0748
之前在v100 32G*4 上也报OOM,然后换到了一台A100 80G*1 就正常跑通了。 现在v100 32G*4上会报标题的错误。而两台机器,8张v100 32G,用scripts/multinode_run.sh 还报这个错误,请问是显存不够的原因吗? 没有打印其他日志 ``` [INFO|modeling_utils.py:2263] 2023-05-29 11:08:21,313 >> Offline mode: forcing local_files_only=True [INFO|modeling_utils.py:2531] 2023-05-29 11:08:21,313 >> loading weights file /workspace/BELLE-7B-2M/pytorch_model.bin [INFO|configuration_utils.py:575] 2023-05-29 11:09:55,190...
多卡启动指令: ``` CUDA_VISIBLE_DEVICES=2,3 torchrun --nproc_per_node 2 train.py \ --model_name_or_path /workspace/BELLE-7B-2M \ --deepspeed configs/deepspeed_config_stage3.json \ --train_file /workspace/BELLE-main-3/data/convert_all_0525.json \ --validation_file /workspace/BELLE-main-3/data/convert_all_0525.json \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 4 \ --num_train_epochs...