Adapter-transformers & DeepSpeed: how to get fp32 weights reconstruction?

Open piegu opened this issue 4 years ago • 0 comments

Environment info

adapter-transformers version: 2.0.1
DeepSpeed version: 0.4.0

Details

Hugging Face allows the use of DeepSpeed in order to acelerate the training of a model in one GPU (or more) (read DeepSpeed Integration).

The use of DeepSpeed can be done in a notebook or in the command line (script *.py). For example, here is an example from Hugging Face: transformers + deepspeed CLI.

Instead of using the examples notebooks or scripts of transformers Hugging Face, it is possible to use the updated scripts by adapter-transformers.

Great... but there is a problem when using Mixed precision training using float16 within DeepSpeed: the resulting *.bin files (model and adapter) are not saved in fp32.

In the case of a DeepSpeed training of Hugging Face transformers script, it is possible thanks to the zero_to_fp32.py script (check the links at the bottom of this message) to get a fp32 weights reconstruction from pytorch_model.bin.

However, how to do that after a DeepSpeed training of adapter-transformers script (with Mixed precision training using float16) on the pytorch_adapter.bin and pytorch_model_head.bin files?

Note: I ran the run_mlm.py script updated by adapter-transformers within DeepSpeed (with Mixed precision training using float16) with a change in line 490 of the Trainer in order to save the model at the end of the training (new code: do_save_full_model=adapter_args.train_adapter). As expected, the pytorch_adapter.bin size was half of its fp32 value. Then, I run the zero_to_fp32.py as explained above and in the listed links, I uploaded my saved model with the following code and I check if the embeddings and layers weights have been unchanged (expected): but they have changed.

from transformers import BertForMaskedLM, AutoTokenizer
model = BertForMaskedLM.from_pretrained(str(path_to_awesome_name_you_picked))

More, I checked the content of the adapter folder that was as following:

adapter_config.json -- 632 B
head_config.json -- 231 B
pytorch_adapter.bin -- 232 MB
pytorch_model_head.bin -- 232 MB

More than 230 MB for each bin file (the trained model was a BERT base) ? I was expected a small value with fp16 weights... strange. In the case of a training witout DeepSpeed, the values of these bin files are as following:

pytorch_adapter.bin -- 29.6 MB
pytorch_model_head.bin -- 94 MB

Links to read about fp32 weights reconstruction:

Jun 21 '21 18:06 piegu