Gradient of the loss w.r.t sharded parameters
Describe the bug Getting gradient of loss during inference as None. I am fine-tuning llama 2 using accelerate+deepseed zero3. During evaluation, which is run after every checkpoint step, I need to calculate gradient loss w.r.t certain transformer V layer. As per my understanding the value matrix is sharded and when I try to get the gradient, I get an error saying that grad is set to None. Is there a cleaner way to do it using accelerate APIs?
To Reproduce Steps to reproduce the behavior:
- import torch import deepspeed from accelerate import Accelerator from accelerate.state import AcceleratorState from transformers import AutoModelForCausalLM, AutoTokenizer
def token_gradients(model, input_ids, targets): valid_positions = (targets != -100).nonzero(as_tuple=True)[0] input_slice = slice(0, valid_positions[0].item()) end_input_slice = valid_positions[-1].item()
embeddings = model.get_input_embeddings()
with deepspeed.zero.GatheredParameters(embeddings.weight, modifier_rank=None):
embedding_weights = embeddings.weight
embedding_size = embedding_weights.shape[0]
one_hot = torch.zeros(
input_ids[input_slice].shape[0],
embedding_size,
device=model.device,
dtype=embeddings.weight.dtype
)
one_hot.scatter_(
1,
input_ids[input_slice].unsqueeze(1),
torch.ones(one_hot.shape[0], 1, device=model.device, dtype=embeddings.weight.dtype)
)
one_hot.requires_grad_()
with deepspeed.zero.GatheredParameters(embeddings.weight, modifier_rank=None):
input_embeds = (one_hot @ embeddings.weight)
input_embeds.requires_grad_()
input_embeds.retain_grad()
print('input_embeds grad ',input_embeds.grad, ' input_embeds ',input_embeds.shape)
input_ids = input_ids.cpu().tolist()
#embeddings corresponding to only input ids
embeds = embeddings.weight[input_ids[:end_input_slice+1],:]
full_embeds = torch.cat(
[
embeds[:input_slice.start,:],
input_embeds,
embeds[input_slice.stop:,:]
],
dim=0)
full_embeds = full_embeds.unsqueeze(0)
print('full_embeds ',full_embeds.shape)
logits = model(inputs_embeds=full_embeds).logits
loss = torch.nn.CrossEntropyLoss()(logits[0,:,:], targets[:end_input_slice+1])
accelerator.backward(loss)
return one_hot.grad.clone(), input_embeds.grad.clone()
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B") optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5, fused=True) accelerator = Accelerator()
this line is only necessary because we don't prepare a dataset
AcceleratorState().deepspeed_plugin.deepspeed_config['train_micro_batch_size_per_gpu'] = 8 model, optimizer = accelerator.prepare(model, optimizer) model.train()
input = torch.tensor([ 1, 894, 29901, 5122, 10753, 304, 14294, 670, 6567, 9098,491, 14051, 10549, 963, 29889, 8449, 19309, 7101, 674, 7738, 278, 1556, 12871, 29973, 13, 22550, 29901, 15589, 5112, 1516]).to(model.device)
target = torch.tensor([ -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,-100, -100, -100, -100, -100, 22550, 29901, 15589, 5112, 1516]).to(model.device)
onehot_grad, inputembed_grad = token_gradients(model, input, target)
- What packages are required and their versions
- How to run the script I pass the following config present in stage3_no_offloading_accelerate.conf: { "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 1e5, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false } My script: accelerate launch --mixed_precision bf16 --num_machines 1 --num_processes $NUM_GPUS --use_deepspeed --deepspeed_config_file stage3_no_offloading_accelerate.conf script.py
Expected behavior I should get gradient of loss w.r.t value vector
ds_report output
Accelerate version: 0.31.0
Platform: Linux-5.15.0-126-generic-x86_64-with-glibc2.35
accelerate bash location: /net/scratch/lcpandia/python_3_11/bin/accelerate
Python version: 3.11.9
Numpy version: 1.26.3
PyTorch version (GPU?): 2.4.0+cu118 (True)
PyTorch XPU available: False
PyTorch NPU available: False
PyTorch MLU available: False
System RAM: 503.56 GB
GPU type: NVIDIA A100 80GB PCIe
Accelerate default config:
Not found
deepspeed 0.15.0
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: [e.g. Ubuntu 18.04]
- GPU count and types [e.g. two machines with x8 A100s each] 4 GPUs
- (if applicable) what DeepSpeed-MII version are you using 0.15.0
- (if applicable) Hugging Face Transformers/Accelerate/etc. versions
- Python version
- Any other relevant info about your setup
Docker context Are you using a specific docker image that you can share?
Additional context Add any other context about the problem here.
@LalchandPandia - could you update the title to reflect your issue?
@loadams I have changed the title
@LalchandPandia
Same question. I want to record the grad_norm for each parameter matrix during training. When i assess the parameter gradient by param.grad, I got None output. Even I pre-gather the parameter by using
with GatheredParameters([param], modifier_rank=0): print(param.grad)
the output is still None.
Any suggestions?
Have you fix it?