hikerell comments

Results 4 comments of


                                            hikerell

[BUG]RuntimeError: Step 1 exited with non-zero status 1

Maybe Your GPU Out-Of-Memory

[BUG]RuntimeError: Step 1 exited with non-zero status 1

> @ucas010, yes, 1.3B model with Adam optimizer needs at 1.3*14GB=~18GB of GPU memory. Your error message suggests that your GPU has ~14GB. Can you try multiple GPUs so that...

RuntimeError: Step 1 exited with non-zero status 1

me too. GPU: 1x A100 40G cat training.log: OutOfMemoryError: CUDA out of memory. Tried to allocate 786.00 MiB (GPU 0; 39.56 GiB total capacity; 38.49 GiB already allocated; 96.56 MiB...

RuntimeError: Step 1 exited with non-zero status 1

Maybe I have resolved the error by reducing batch size. I modified the step-1 script [training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh](https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh), just add `--per_device_train_batch_size 8` and `--per_device_eval_batch_size 8`: ```shell deepspeed --num_gpus 1 main.py --model_name_or_path facebook/opt-1.3b...