h2o-llmstudio
                                
                                
                                
                                    h2o-llmstudio copied to clipboard
                            
                            
                            
                        [BUG] HuggingFace export does not preserve bfloat16 weights but converts to float16 silently when using CPU for upload
🐛 Bug
Native bfloat16 model fine-tuned with bfloat16 gets pushed to HuggingFace as float16
To Reproduce
- Choose a HF model like Llama-3 with weights natively as bfloat16
 - Fine-tune it using dtype of bfloat16
 - Export it to HuggingFace
 - Note that the config.json specifies the weights of the fine-tuned model as float16 (not bfloat16) as you'd expect
 
Could you please share a config to reproduce the issue on the default dataset?
A quick check showed bfloat16 for me when uploading a fine-tune of danube2 to huggingface:
A known limitation is the upload using CPU. That is automatically converted to float16, as pytorch bfloat16 isn't usually supported on CPU.
Ah that's it exactly then, I've been using CPU to upload. Will try using GPU.
Thanks, I'll change the topic of the issue to reflect that the conversion is done silently. We probably want to raise a warning.
Actually @pascal-pfeiffer I've found that unfortunately I don't have enough GPU memory on any single GPU on an 8XA100 80GB cluster to push Llama-3 70B to HF using bfloat16. I get the following OOM error. Any ideas of a workaround or way this could be done multi-GPU?
INFO:     127.0.0.1:56582 - "POST / HTTP/1.1" 200 OK
2024-05-09 17:59:46,609 - INFO: Initializing client True
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-09 17:59:47,245 - INFO: Stop token ids: [tensor([  27,   91, 9125,   91,   29])]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-09 17:59:48,122 - INFO: Stop token ids: [tensor([  27,   91, 9125,   91,   29], device='cuda:0')]
2024-05-09 17:59:48,137 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id.
2024-05-09 17:59:48,137 - INFO: Setting pretraining_tp of model config to 1.
2024-05-09 17:59:48,159 - INFO: Using bfloat16 for backbone
2024-05-09 17:59:48,159 - INFO: Using Flash Attention 2.
2024-05-09 17:59:48,379 - ERROR: Unknown exception
Traceback (most recent call last):
File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/app_utils/handlers.py", line 337, in handle
await experiment_push_to_huggingface_dialog(q)
File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/app_utils/sections/experiment.py", line 1829, in experiment_push_to_huggingface_dialog
publish_model_to_hugging_face(
File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/app_utils/hugging_face_utils.py", line 108, in publish_model_to_hugging_face
cfg, model, tokenizer = load_cfg_model_tokenizer(
File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/app_utils/sections/chat.py", line 219, in load_cfg_model_tokenizer
model = cfg.architecture.model_class(cfg)
File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/src/models/text_causal_language_modeling_model.py", line 32, in init
self.backbone, self.backbone_config = create_nlp_backbone(
File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/src/utils/modeling_utils.py", line 804, in create_nlp_backbone
backbone = model_class.from_config(config, **kwargs)
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 437, in from_config
return model_class._from_config(config, **kwargs)
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1401, in _from_config
model = cls(config, **kwargs)
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1135, in init
self.model = LlamaModel(config)
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 927, in init
[LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 927, in 
Right, for very large models that don't fit on a single GPU, we added a workaround that loads the full weights to CPU first and then shards across your GPUs before uploading. Can you try uploading the weights with cpu_shard in the device selection?
And actually, I just tried removing our forced cast to float32 and back to float16 when using CPU. It might no longer be needed with recent dependencies upgrades.
We should improve at least the description here to reflect all things that are done under the hood.
Ah I didn't realize that's what cpu_shard did. It sounds like it will support bfloat16 then?
Yes, cpu_shard supports bfloat16.
Confirmed I can export to HF with bfloat16 when using the cpu_shard setting.