adapters Adapter support for GPTNeoX

I have implemented adapter for GPTNeoX following the instructions in the documentation. It passed all tests but during the training of the language adapter, it trained the prediction head too. Do you by chance have an idea why this is happening? Do a PR?

Mar 17 '23 12:03 ajesujoba

Hey @ajesujoba, this sounds great, would be awesome to have GPTNeoX support integrated into the library, so feel free to do a PR!

Regarding your question on language adapter training, could you add some more context on what you observed and which behavior you expected (ideally with some code snippet). Thank you!

Mar 20 '23 17:03 calpt

Thanks for your response @calpt. I have made a PR So I was trying to train a German language adapter using the implemented GPTNeoX with the script below:

LANG="de"
python run_clm.py \
        --model_name_or_path EleutherAI/pythia-70m \
        --train_file $DATADIR/train.txt \
        --validation_file  $DATADIR/dev.txt \
        --output_dir $OUTDIR/$LANG \
        --do_train \
        --do_eval \
        --per_device_train_batch_size 16 \
        --per_device_eval_batch_size 8 \
        --gradient_accumulation_steps 1 \
        --learning_rate 5e-5 \
        --max_steps 500 \
        --num_train_epochs 25 \
        --save_steps 10000000 \
        --overwrite_output_dir \
        --train_adapter \
        --adapter_config pfeiffer+inv \
        --evaluation_strategy steps \
        --eval_steps 1000000 \
        --load_best_model_at_end \
        --save_total_limit 1

The script ran successfully, but instead of training just the adapters, it was training both the adapter modules and the CLM head. So the total number of trainable paramters were 26087360 instead of just 331712

[INFO|trainer.py:1650] 2023-03-21 18:13:53,209 >> ***** Running training *****
[INFO|trainer.py:1651] 2023-03-21 18:13:53,209 >>   Num examples = 386
[INFO|trainer.py:1652] 2023-03-21 18:13:53,209 >>   Num Epochs = 20
[INFO|trainer.py:1653] 2023-03-21 18:13:53,209 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:1654] 2023-03-21 18:13:53,209 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:1655] 2023-03-21 18:13:53,209 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1656] 2023-03-21 18:13:53,209 >>   Total optimization steps = 500
[INFO|trainer.py:1657] 2023-03-21 18:13:53,210 >>   Number of trainable parameters = 26087360

I was able to manually freeze these CLM head using model.embed_out.requires_grad_(False) in the run_clm.py but this is not expected. Kindly let me know if I need to provide more context.

Mar 21 '23 17:03 ajesujoba

Thanks for providing the additional context (and of course thanks for opening the PR!).

After looking into it a bit deeper, the cause for this behavior seems to be that GPT-NeoX does not tie the weights of the input and output projection layers. By default, adapter-transformers will only freeze the weights of the base model, excluding the weights of any prediction head (as you usually want to fine-tune that with the adapter). Thus, for LM heads, freezing the output projection layer relies on the fact that most models supported so far share these weights with the input projection (which is part of the base model, therefore frozen).

To ensure the expected behavior also for GPT-NeoX, we'd probably need to freeze the output projection manually somewhere in the code. Maybe adding it to the freeze_model() method in the model mixin via self.get_output_embeddings() would work.

Mar 27 '23 19:03 calpt

Hi @calpt, thanks for your feedback. I thought as much, I also noticed that they did not tie the weights of the input and output projection layers.

Yes, I agree that freezing the prediction head somewhere else such as within freeze_model() would be the best option. I guess this would be done at your end right?

Mar 27 '23 19:03 ajesujoba

You can directly integrate a fix for this into your PR with the model integration if you like. Otherwise, I could also add it independently.

Mar 27 '23 19:03 calpt

Checking again it appears it is not plausible to have it fixed within freeze_model(). self within model_mixin.py is the base_model without any prediction head (because the embeddings are not tied). I may be wrong. So I guess a possible place to fix this would be within training.py. If it is fine with you, you can add it independently so that I don't break a lot of things

Mar 27 '23 21:03 ajesujoba

adapters adapters copied to clipboard

Adapter support for GPTNeoX

adapters
adapters copied to clipboard