I failed to reproduce the example on deepspeed tutorials with huggingface transformers. The main problem is that I need the memory space at least 3x parameters, and it would be 1x parameters plus buffers. I have turned down the bucket_size (include "allgather_bucket_size", "reduce_bucket_size"), turned off "overlap_comm", and turned on "gradient_checkpointing". Besides, my batchsize=1, seq_len=20. Why is the memory consumption so large? @tjruwase

trainer.py

``

import torch from transformers import GPT2Tokenizer, GPT2Config, GPT2LMHeadModel from transformers import Trainer, TrainingArguments, DataCollatorForSeq2Seq from datasets import load_dataset

if name == 'main':

# Initializing a TrainingArguments
training_args = TrainingArguments(
    output_dir='./',
    do_train=True,
    do_eval=False,
    do_predict=False,
    eval_strategy='no',
    gradient_accumulation_steps=1,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    num_train_epochs=20,
    save_strategy = "no",
    logging_dir='./logs',
    deepspeed = 'ds_config.json',
    fp16=True,
    bf16=False,
    learning_rate=3e-05,
    adam_beta1=0.8,
    adam_beta2=0.999,
    weight_decay=3e-07,
    warmup_steps=10
)

# Initializing a GPT2 model
configuration = GPT2Config(vocab_size=50257,
                           n_positions=1024,
                           n_embd=4096,
                           n_layer=32,
                           n_head=32,
                           use_cache=False,)
model = GPT2LMHeadModel(configuration)
# Initializing a GPT2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
    inputs = tokenizer(examples['text'], return_tensors='pt', truncation=True, padding='max_length', max_length=20)
    temp = inputs['input_ids']
    inputs['input_ids'] = temp.clone()
    inputs['labels'] = temp.clone()
    return inputs

# Load dataset
dataset = load_dataset("json", data_files="text_short.json")
# Tokenize the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True, batch_size=4, remove_columns=dataset['train'].column_names)

model.half()

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    data_collator=DataCollatorForSeq2Seq(tokenizer),
)
trainer.train()

``

ds_config.json

`` {

"tensorboard": {
  "enabled": true,
  "output_path": "./logs",
  "job_name": "tensorboard_log"
},

"csv_monitor": {
  "enabled": true,
  "output_path": "./logs",
  "job_name": "csv_log"
},

"bf16": {
      "enabled": "auto"
  },

"fp16": {
      "enabled": "auto",
      "loss_scale": 0,
      "loss_scale_window": 1000,
      "initial_scale_power": 16,
      "hysteresis": 2,
      "min_loss_scale": 1
  },

  "optimizer": {
      "type": "AdamW",
      "params": {
          "lr": 3e-5,
          "betas": [0.8, 0.999],
          "eps": 1e-8,
          "weight_decay": 3e-7
      }
  },

  "scheduler": {
      "type": "WarmupLR",
      "params": {
          "warmup_min_lr": 0,
          "warmup_max_lr": 3e-5,
          "warmup_num_steps": 10
      }
  },

  "zero_optimization": {
      "stage": 2,
      "offload_optimizer": {
          "device": "cpu",
          "pin_memory": true
      },
      "round_robin_gradients": true,
      "allgather_partitions": true,
      "allgather_bucket_size": 2e4,
      "overlap_comm": false,
      "reduce_scatter": true,
      "reduce_bucket_size": 2e4,
      "contiguous_gradients": true
  },
  "gradient_accumulation_steps": 1,
  "gradient_clipping": "auto",
  "steps_per_print": 2000,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_checkpointing": true,
  "wall_clock_breakdown": false

} ``

Mar 28 '25 06:03 arminzhu

@LaosGAmin, can you share your log or stack trace?

Mar 28 '25 17:03 tjruwase

@LaosGAmin, can you share your log or stack trace? Very very thanks for your reply! I think you are the last straw for me. How can I get that? You mean the printed info?

Mar 29 '25 01:03 arminzhu

@tjruwase Very very thanks for your reply! I think you are the last straw for me. How can I get them? Do you mean the printed info?

Mar 30 '25 10:03 arminzhu

@LaosGAmin, yes the error message will contain the stack trace. Also, you can share the full output log.

Mar 30 '25 18:03 tjruwase

@tjruwase In fact, it worked very well and there were no errors. It's just that the GPU memory consumption is not as low as expected. About few days ago, I commented on https://github.com/deepspeedai/DeepSpeed/commit/7b5b06602d5941cf7ea6170062d3f81c9002d788. I modified the async data copy (from host to device) on my local machine, and the memory consumption can be changed to about 2x parameters. But I didn't konw why tensor.copy_() will acollate additional memory space in deepspeed. Besides, where is the other memory consumption that about 1x parameters? Is it caused by offloading gradients? And this is the info:

Mar 31 '25 05:03 arminzhu

@LaosGAmin, sorry I previously misunderstood your question. Now, I understand that you are trying to understand the memory consumption of your run. Can you share how you are currently measuring memory usage? It seems you are saying the memory usage is 3X what is expected? Can you please give more details of expected and observed memory usage?

Mar 31 '25 12:03 tjruwase

@tjruwase The memory consumption is observed by nvida-smi, and torch.cuda.max_memory_reserved() will give a similar value. The expected memory consumption should be as similar as that on deepspeed tutorials zero-offload. When training a 10B size model, 32GiB GPU memory should be enough. But on my machine, it needs 64GiB GPU memory. And I tried many different size models. It at least needs 3x parameters (fp16) memory. When I have modified the async data copy (from host to device) on my local machine, the memory consumption can be changed to about 2x parameters (fp16). So I think it would be caused by gradients. Gradients are not deleted? Because when I turned off the "overlap_comm", things did not change. Of course, it also has another possibility, there is another set of parameters stored in GPU memory.

Apr 01 '25 06:04 arminzhu

@tjruwase The memory consumption is observed by nvidia-smi, and torch.cuda.max_memory_reserved() will give a similar value. The expected memory consumption should be as similar as that on deepspeed tutorials zero-offload. When training a 10B size model, 32GiB GPU memory should be enough. But on my machine, it needs 64GiB GPU memory. And I tried many different size models. It at least needs 3x parameters (fp16) memory. When I have modified the async data copy (from host to device) on my local machine, the memory consumption can be changed to about 2x parameters (fp16). So I think it may be caused by gradients. Gradients not deleted in time? And when I turned off the "overlap_comm", things did not change. Of course, it also has another possibility, there is another set of parameters stored in GPU memory.

Apr 02 '25 10:04 arminzhu

@LaosGAmin, both nvidia-smi and torch.cuda.max_memory_reserved() report more than current GPU memory consumption. A more precise API is to use torch.cuda.memory_allocated: https://pytorch.org/docs/stable/generated/torch.cuda.memory_allocated.html#torch.cuda.memory_allocated.

You can also instrument your code with DeepSpeed utility called see_memory_usage(): https://github.com/deepspeedai/DeepSpeed/blob/79ff16272274e9f71dc631716cf20224190b5d11/deepspeed/runtime/utils.py#L771

Apr 02 '25 12:04 tjruwase

@tjruwase To see memory usage is no use bro. If empty memory in nvidia-smi is not enough, it will OOM. And I really can't reproduce it as same as that in tutorial, when I use ZeRO-offload with hugging face transformers.

Apr 02 '25 13:04 arminzhu

DeepSpeed
DeepSpeed copied to clipboard

Failed to reproduce the offload example with huggingface transformers

trainer.py

ds_config.json

DeepSpeed DeepSpeed copied to clipboard

Failed to reproduce the offload example with huggingface transformers

trainer.py

ds_config.json

DeepSpeed
DeepSpeed copied to clipboard