xmyu28
xmyu28
same problem when training, have you solved it yet?
> Hi @vishaal27, thank you for the great question. Yes it is easy to do this with LLaVA. > > Here is a simple example that you may start with,...
I've encountered the same error when using deepspeed zero3, so which script should be used for debugging?
你好 4卡a100 80gb能跑微调吗
batch_size 2 gradient_accumulation_steps 8 也oom 好奇怪
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: finetune.py FAILED Failures: Root Cause (first observed failure): [0]: time : 2024-05-26_22:14:56 host : gpu08.cluster.com rank : 2 (local_rank: 2) exitcode : -9 (pid: 2666684) error_file: traceback : Signal...
尝试调小batch size 但还是报错
======================================================== [2024-05-26 19:30:30,675] torch.distributed.run: [WARNING] [2024-05-26 19:30:30,675] torch.distributed.run: [WARNING] ***************************************** [2024-05-26 19:30:30,675] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system...
srun显示检测到了oom 但是4卡把batchsize砍半也会oom嘛 谢谢回答 torchrun $DISTRIBUTED_ARGS finetune.py \ --model_name_or_path $MODEL \ --llm_type $LLM_TYPE \ --data_path $DATA \ --eval_data_path $EVAL_DATA \ --remove_unused_columns false \ --label_names "labels" \ --prediction_loss_only false \ --bf16 true...
你好 改成cpu了 有这个报错Exception ignored in: Traceback (most recent call last): File "/home/xmyu/anaconda3/envs/MiniCPM-V/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__ self.ds_opt_adam.destroy_adam(self.opt_id) AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam' @Cuiunbo @qyc-98