[BUG] a huge memory leak when using `register_full_backward_hook`
Describe the bug
When trying to use register_full_backward_hook in Megatron-Deepspeed, I get a huge memory leak.
I'm reporting it here, since when I turn off deepspeed, there is no leak.
To Reproduce
I tried to create a small independent example that uses deepspeed directly but I couldn't make it leak.
So, let's work with Megatron-Deepspeed. We can use either the bigscience version or your original one - it leaks in both versions (since the problem is triggered by deepspeed).
git clone https://github.com/microsoft/Megatron-DeepSpeed
cd Megatron-DeepSpeed
now apply this patch:
diff --git a/megatron/mpu/cross_entropy.py b/megatron/mpu/cross_entropy.py
index 8c790cd..a0b40b1 100644
--- a/megatron/mpu/cross_entropy.py
+++ b/megatron/mpu/cross_entropy.py
@@ -107,4 +107,4 @@ class _VocabParallelCrossEntropy(torch.autograd.Function):
def vocab_parallel_cross_entropy(vocab_parallel_logits, target):
"""Helper function for the cross entropy."""
- return _VocabParallelCrossEntropy.apply(vocab_parallel_logits, target)
+ return _VocabParallelCrossEntropy.apply(vocab_parallel_logits.clone(), target)
diff --git a/megatron/training.py b/megatron/training.py
index e3a168c..9389029 100644
--- a/megatron/training.py
+++ b/megatron/training.py
@@ -692,6 +692,13 @@ def train(forward_step_func, model, optimizer, lr_scheduler,
# Write args to tensorboard
write_args_to_tensorboard()
+ def backward_hook(module, input, output): pass
+ def _register_backward_hook(module):
+ module.register_full_backward_hook(backward_hook)
+ #module.register_backward_hook(backward_hook)
+ model[0].apply(_register_backward_hook)
+
+
# Turn on training mode which enables dropout.
for model_module in model:
model_module.train()
The cross_entropy change has to do with an issue in megatron-lm - unrelated to this issue, but is required to be able to use backward hooks.
Now you can see that I'm adding a no-op backward hook. A very trivial change.
If I use the new register_full_backward_hook I get a huge leak, when running train. If I use the deprecated register_backward_hook all is good.
If I turn off deepspeed the leak goes away as well.
I experimented with removing various configs, disabling Z1 - didn't make a difference, so it's somewhere in the engine.
I started researching the cause of the leak in general and found this discussion: https://discuss.pytorch.org/t/register-full-backward-hook-causes-memory-leak/122904 which suggests that somewhere backward creates a graph which creates a self-reference loop, so the tensors never get released.
Using the above patch you should be able to reproduce the leak withing 10 iterations on a tiny model. I'm not sure how you test Megatron-Deepspeed. You can for example use our test suite from https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/tests/test_training.py.
or you can use this, but you will need to create a bit of data and grab the vocab files from https://github.com/NVIDIA/Megatron-LM#downloading-checkpoints
CHECKPOINT_PATH=checkpoints/gpt2
VOCAB_FILE=data/gpt2-vocab.json
MERGE_FILE=data/gpt2-merges.txt
#DATA_PATH=data/meg-gpt2_text_document
DATA_PATH=data/meg-gpt2_oscar-combined_text_document
TENSORBOARD_PATH=output_dir/tensorboard
N_GPUS=2
MICRO_BATCH_SIZE=1
GLOBAL_BATCH_SIZE=16
TP_SIZE=2
PP_SIZE=1
SEQ_LEN=1024
SAVE_INTERVAL=50
# --train-samples 10_000 \
# --exit-interval $EXIT_INTERVAL \
GPT_ARGS=" \
--num-layers 2 \
--hidden-size 64 \
--num-attention-heads 2 \
--ffn-hidden-size 256 \
--seq-length $SEQ_LEN \
--max-position-embeddings $SEQ_LEN \
--micro-batch-size $MICRO_BATCH_SIZE \
--rampup-batch-size 2 2 1_000 \
--global-batch-size $GLOBAL_BATCH_SIZE \
--train-samples 100 \
--optimizer adam \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--adam-eps 1e-8 \
--lr 1e-4 \
--lr-warmup-samples 5 \
--clip-grad 1.0 \
--weight-decay 1e-1 \
--vocab-file $VOCAB_FILE \
--merge-file $MERGE_FILE \
--fp16 \
--partition-activations \
--seed 42 \
"
# --tokenizer-type PretrainedFromHF \
# --tokenizer-name-or-path t5-small \
# --train-iters 500 \
OUTPUT_ARGS=" \
--exit-interval 100 \
--log-interval 10 \
--save-interval $SAVE_INTERVAL \
--eval-interval 100 \
--eval-iters 10 \
--checkpoint-activations \
"
DATA_ARGS=" \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH \
--tensorboard-dir $TENSORBOARD_PATH \
--tensorboard-queue-size 5 \
--log-timers-to-tensorboard \
--log-batch-size-to-tensorboard \
--log-validation-ppl-to-tensorboard \
"
ZERO_STAGE=1
config_json="./ds_config.json"
# Deepspeed figures out GAS dynamically from dynamic GBS via set_train_batch_size()
cat <<EOT > $config_json
{
"train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
"train_batch_size": $GLOBAL_BATCH_SIZE,
"gradient_clipping": 1.0,
"zero_optimization": {
"stage": $ZERO_STAGE
},
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 500,
"hysteresis": 2,
"min_loss_scale": 1,
"initial_scale_power": 12
},
"steps_per_print": 2000,
"wall_clock_breakdown": false
}
EOT
DEEPSPEED_ARGS=" \
--deepspeed \
--deepspeed_config ${config_json} \
--zero-stage ${ZERO_STAGE} \
--deepspeed-activation-checkpointing \
"
ALL_ARGS="$GPT_ARGS $OUTPUT_ARGS $DATA_ARGS $DEEPSPEED_ARGS"
# if you can't stand pt-1.9 launcher noise
export LOGLEVEL=WARNING
#PYTHONPATH=~/github/00optimize/deepspeed-big-science:/hf/Megatron-DeepSpeed-master
#PYTHONPATH=/hf/Megatron-DeepSpeed-master
LAUNCHER="deepspeed --num_gpus $N_GPUS --master_port 6777"
export CMD=" \
env USE_TF=0 \
$LAUNCHER pretrain_gpt.py \
--tensor-model-parallel-size $TP_SIZE \
--pipeline-model-parallel-size $PP_SIZE \
--distributed-backend nccl \
$ALL_ARGS \
"
echo $CMD
#rm -rf $CHECKPOINT_PATH
$CMD
I'm testing with pytorch-1.10, and deepspeed@master.
Thank you!
@jeffra, @tjruwase
Did you solve it?