binderwang
binderwang
I met the relaunching problem as well. I found the reason maybe the client port 3001, I changed to the origin 3101 and it worked. May it help somebody.
In the bash file, I just modify the command line as this: `deepspeed --num_gpus 1 main.py \ --model_name_or_path facebook/opt-1.3b \ --per_device_train_batch_size 8 --per_device_eval_batch_size 8 \ --gradient_accumulation_steps 2 --lora_dim 128 --zero_stage...
And the error is as Beginning of Epoch 1/1, Total Micro Batches 2860 Traceback (most recent call last): File "main.py", line 341, in main() File "main.py", line 310, in main...
It is wierd that when I only add the `--gradient_checkpointing` with the default script, I can successfully run these steps with a V100-32G, and only using half of the GPU...