Default configuration running with V100-32G causes OOM
When using the official default configuration with a single V100-32G, I found this will cause OOM with the whole pipline. And according to other issues mentioned above, I changed the batch_size from 16 to 8, and it works. So I just wonder if this is normal 'cause I did not see any instruction.
@binderwang, can you please share the command line and stack trace of the OOM? Thanks!
Assuming you are using the default settings for the single-gpu deployment type, I think this is expected with only 32GB of memory. Lowering the batch size will decrease memory requirements.
In the bash file, I just modify the command line as this:
deepspeed --num_gpus 1 main.py \ --model_name_or_path facebook/opt-1.3b \ --per_device_train_batch_size 8 --per_device_eval_batch_size 8 \ --gradient_accumulation_steps 2 --lora_dim 128 --zero_stage $ZERO_STAGE \ --only_optimize_lora --deepspeed --output_dir $OUTPUT &> $OUTPUT/training.log
And I set zero_stage to 2. With this above command I can run Step1 successfully with a single V100-32G. But If I change nothing with the default command line, it will cause OOM error.
And the error is as
Beginning of Epoch 1/1, Total Micro Batches 2860
Traceback (most recent call last):
File "main.py", line 341, in
@binderwang this is expected behavior. The scripts we released are intended for A6000/A100 GPUs. If you are running on a GPU with less memory, you will be required to modify the scripts (e.g., lower the batch size) so that the model can train with less available memory.
It is wierd that when I only add the --gradient_checkpointing with the default script, I can successfully run these steps with a V100-32G, and only using half of the GPU memory.
@binderwang the options that will have the greatest impact on memory requirements right now are --gradient_checkpointing, --only_optimize_lora, --zero_stage {0|1|2|3}, and --per_device_*_batch_size.
The examples we provide were tuned for specific systems (primarily A6000 / A100 GPUs) and we encourage users to modify the examples to fit their systems. We are also working on an autotuning feature which will automatically detect optimal settings for the parameters listed above. Stay tuned for those updates!