DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Gradient of the loss w.r.t sharded parameters

Open LalchandPandia opened this issue 8 months ago • 2 comments

Describe the bug Getting gradient of loss during inference as None. I am fine-tuning llama 2 using accelerate+deepseed zero3. During evaluation, which is run after every checkpoint step, I need to calculate gradient loss w.r.t certain transformer V layer. As per my understanding the value matrix is sharded and when I try to get the gradient, I get an error saying that grad is set to None. Is there a cleaner way to do it using accelerate APIs?

To Reproduce Steps to reproduce the behavior:

  1. import torch import deepspeed from accelerate import Accelerator from accelerate.state import AcceleratorState from transformers import AutoModelForCausalLM, AutoTokenizer

def token_gradients(model, input_ids, targets): valid_positions = (targets != -100).nonzero(as_tuple=True)[0] input_slice = slice(0, valid_positions[0].item()) end_input_slice = valid_positions[-1].item()

embeddings = model.get_input_embeddings()
with deepspeed.zero.GatheredParameters(embeddings.weight, modifier_rank=None):
    embedding_weights = embeddings.weight
    embedding_size = embedding_weights.shape[0]

one_hot = torch.zeros(
    input_ids[input_slice].shape[0],
    embedding_size,
    device=model.device,
    dtype=embeddings.weight.dtype
)
one_hot.scatter_(
    1,
    input_ids[input_slice].unsqueeze(1),
    torch.ones(one_hot.shape[0], 1, device=model.device, dtype=embeddings.weight.dtype)
)
one_hot.requires_grad_()
with deepspeed.zero.GatheredParameters(embeddings.weight, modifier_rank=None):
    input_embeds = (one_hot @ embeddings.weight)
    input_embeds.requires_grad_()
    input_embeds.retain_grad()
    print('input_embeds grad ',input_embeds.grad, ' input_embeds ',input_embeds.shape)
    input_ids = input_ids.cpu().tolist()
    #embeddings corresponding to only input ids
    embeds = embeddings.weight[input_ids[:end_input_slice+1],:]
full_embeds = torch.cat(
    [
        embeds[:input_slice.start,:],
        input_embeds,
        embeds[input_slice.stop:,:]
    ],
    dim=0)
full_embeds = full_embeds.unsqueeze(0)
print('full_embeds ',full_embeds.shape)
logits = model(inputs_embeds=full_embeds).logits
loss = torch.nn.CrossEntropyLoss()(logits[0,:,:], targets[:end_input_slice+1])
accelerator.backward(loss)
return one_hot.grad.clone(), input_embeds.grad.clone()

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B") optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5, fused=True) accelerator = Accelerator()

this line is only necessary because we don't prepare a dataset

AcceleratorState().deepspeed_plugin.deepspeed_config['train_micro_batch_size_per_gpu'] = 8 model, optimizer = accelerator.prepare(model, optimizer) model.train()

input = torch.tensor([ 1, 894, 29901, 5122, 10753, 304, 14294, 670, 6567, 9098,491, 14051, 10549, 963, 29889, 8449, 19309, 7101, 674, 7738, 278, 1556, 12871, 29973, 13, 22550, 29901, 15589, 5112, 1516]).to(model.device)

target = torch.tensor([ -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,-100, -100, -100, -100, -100, 22550, 29901, 15589, 5112, 1516]).to(model.device)

onehot_grad, inputembed_grad = token_gradients(model, input, target)

  1. What packages are required and their versions
  2. How to run the script I pass the following config present in stage3_no_offloading_accelerate.conf: { "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 1e5, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false } My script: accelerate launch --mixed_precision bf16 --num_machines 1 --num_processes $NUM_GPUS --use_deepspeed --deepspeed_config_file stage3_no_offloading_accelerate.conf script.py

Expected behavior I should get gradient of loss w.r.t value vector

ds_report output

Accelerate version: 0.31.0
Platform: Linux-5.15.0-126-generic-x86_64-with-glibc2.35
accelerate bash location: /net/scratch/lcpandia/python_3_11/bin/accelerate
Python version: 3.11.9
Numpy version: 1.26.3
PyTorch version (GPU?): 2.4.0+cu118 (True)
PyTorch XPU available: False
PyTorch NPU available: False
PyTorch MLU available: False
System RAM: 503.56 GB
GPU type: NVIDIA A100 80GB PCIe
Accelerate default config:
Not found
deepspeed 0.15.0

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • OS: [e.g. Ubuntu 18.04]
  • GPU count and types [e.g. two machines with x8 A100s each] 4 GPUs
  • (if applicable) what DeepSpeed-MII version are you using 0.15.0
  • (if applicable) Hugging Face Transformers/Accelerate/etc. versions
  • Python version
  • Any other relevant info about your setup

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

LalchandPandia avatar Apr 23 '25 03:04 LalchandPandia

@LalchandPandia - could you update the title to reflect your issue?

loadams avatar Apr 24 '25 15:04 loadams

@loadams I have changed the title

LalchandPandia avatar Apr 30 '25 05:04 LalchandPandia

@LalchandPandia
Same question. I want to record the grad_norm for each parameter matrix during training. When i assess the parameter gradient by param.grad, I got None output. Even I pre-gather the parameter by using with GatheredParameters([param], modifier_rank=0): print(param.grad) the output is still None. Any suggestions? Have you fix it?

weeknan avatar Jul 22 '25 14:07 weeknan