SmartEdit icon indicating copy to clipboard operation
SmartEdit copied to clipboard

issue of mixed-precision training.

Open Bilibilee opened this issue 4 months ago • 0 comments

In the scripts script/MLLMSD_7b.sh and script/SmartEdit_7b.sh, you have specified --bf16 True, yet it seems that the corresponding deepspeed configuration in scripts/zero_mixed.json is missing the line "bf16": {"enabled": "auto"}. As a result, the --bf16 True flag does not appear to be taking effect. I would like to confirm whether this is a mistake or intentional.

Additionally, when training the MLLMSD 7b model, the logs indicate the following data types, which indicates that you have set some parts of the model to use float32 and other parts to use bfloat16.

1. model.vision_tower.dtype: torch.float32
2. model.mm_projector.dtype: torch.float32
3.1. model.model.model(LLaMA).embed_tokens.dtype: torch.float32
3.2. model.model.model(LLaMA).dtype: torch.bfloat16 torch.bfloat16
3.3. model.lm_head.dtype: torch.float32
4.1. model.sd_query_tokens.dtype: torch.float32
4.2. model.sd_qformer.dtype: torch.float32
5.1. model.vae.dtype: torch.bfloat16
5.2. model.unet.dtype: torch.float32

It is acceptable to me that the LLM model uses torch.bfloat16. However, I am curious as to why the VAE model, which has a relatively small number of parameters, is also set to use torch.bfloat16. Could there be a specific reason for this choice?

Bilibilee avatar Oct 01 '24 06:10 Bilibilee