PiPPy
PiPPy copied to clipboard
Failed to run fine-tuning (freezing some layers) of hf model with pippy
I'm trying to do fine-tuning of language-modeling, freezing some first layers of RoBERTa. The code is pretty similar to run_mlm.py example from https://github.com/pytorch/PiPPy/tree/main/examples/hf/language-modeling. But I get error in step of synchronization after first forward pass. See details below:
Env
Ubuntu 20.04.4 LTS
Python 3.8.10
transformers 4.32.0
torch 2.0.1+cu117
The differences are:
- Using
pytorchnative training instead ofpippy.hf.PiPPyTraineroraccelerate - Manually splitting into submodules using
anotate_states, so that all stages get trainable parameters (see code below) - Other minor modifications to run complete
pippyvariant (manually reading from environment variables info provided bytorchrun, instantiating optimizer and lr scheduler, deleting calling backward and others)
layers = (
"roberta.encoder.layer.8",
"roberta.encoder.layer.9",
"roberta.encoder.layer.10",
"roberta.encoder.layer.11",
"lm_head"
)
for name, param in model.named_parameters():
if name.startswith(layers):
param.requires_grad = True
else:
param.requires_grad = False
annotate_split_points(
model,
{
f'roberta.encoder.layer.9': PipeSplitWrapper.SplitPoint.BEGINNING,
f'roberta.encoder.layer.10': PipeSplitWrapper.SplitPoint.BEGINNING,
f'roberta.encoder.layer.11': PipeSplitWrapper.SplitPoint.BEGINNING
})
Problem [see error message below]
But, unfortunately, after freezing the layers, I encounter a problem during the synchronization of gradients (in method _sync_replicated_params). If I remove the freeze, the code is successfully launched and training is underway.
Traceback (most recent call last):
File "lm_no_trainer_pippy_ddp.py", line 805, in <module>
run_pippy(run_master, args)
File "/home/ubuntu/envs/pippy/lib/python3.8/site-packages/torchpippy-0.1.1-py3.8.egg/pippy/utils.py", line 155, in run_pippy
run_worker(args.rank, run_func, args, *extra_args)
File "/home/ubuntu/envs/pippy/lib/python3.8/site-packages/torchpippy-0.1.1-py3.8.egg/pippy/utils.py", line 270, in run_worker
run_func(my_pp_ranks, args, *extra_args)
File "lm_no_trainer_pippy_ddp.py", line 718, in run_master
outputs = pipe_driver(**batch)
File "/home/ubuntu/envs/pippy/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/envs/pippy/lib/python3.8/site-packages/torchpippy-0.1.1-py3.8.egg/pippy/PipelineDriver.py", line 2185, in forward
self._sync_replicated_params()
File "/home/ubuntu/envs/pippy/lib/python3.8/site-packages/torchpippy-0.1.1-py3.8.egg/pippy/PipelineDriver.py", line 1602, in _sync_replicated_params
synced_value = torch.sum(torch.stack(grad_values), dim=0)
TypeError: expected Tensor as element 0 in argument 0, but got NoneType
So, if someone can explain what I'm doing wrong or show an example of how to do fine-tunning correctly, I will be very grateful.
Unfortunately, I didn't find any examples with freezing layers in the repository, so I think it will be useful to add such examples too.