h2o-llmstudio icon indicating copy to clipboard operation
h2o-llmstudio copied to clipboard

[BUG] HuggingFace export does not preserve bfloat16 weights but converts to float16 silently when using CPU for upload

Open tmostak opened this issue 1 year ago • 9 comments

🐛 Bug

Native bfloat16 model fine-tuned with bfloat16 gets pushed to HuggingFace as float16

To Reproduce

  1. Choose a HF model like Llama-3 with weights natively as bfloat16
  2. Fine-tune it using dtype of bfloat16
  3. Export it to HuggingFace
  4. Note that the config.json specifies the weights of the fine-tuned model as float16 (not bfloat16) as you'd expect

tmostak avatar May 09 '24 17:05 tmostak

Could you please share a config to reproduce the issue on the default dataset? A quick check showed bfloat16 for me when uploading a fine-tune of danube2 to huggingface: image

A known limitation is the upload using CPU. That is automatically converted to float16, as pytorch bfloat16 isn't usually supported on CPU.

pascal-pfeiffer avatar May 09 '24 17:05 pascal-pfeiffer

Ah that's it exactly then, I've been using CPU to upload. Will try using GPU.

tmostak avatar May 09 '24 17:05 tmostak

Thanks, I'll change the topic of the issue to reflect that the conversion is done silently. We probably want to raise a warning.

pascal-pfeiffer avatar May 09 '24 17:05 pascal-pfeiffer

Actually @pascal-pfeiffer I've found that unfortunately I don't have enough GPU memory on any single GPU on an 8XA100 80GB cluster to push Llama-3 70B to HF using bfloat16. I get the following OOM error. Any ideas of a workaround or way this could be done multi-GPU?

INFO: 127.0.0.1:56582 - "POST / HTTP/1.1" 200 OK 2024-05-09 17:59:46,609 - INFO: Initializing client True Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2024-05-09 17:59:47,245 - INFO: Stop token ids: [tensor([ 27, 91, 9125, 91, 29])] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2024-05-09 17:59:48,122 - INFO: Stop token ids: [tensor([ 27, 91, 9125, 91, 29], device='cuda:0')] 2024-05-09 17:59:48,137 - WARNING: PAD token id not matching between config and tokenizer. Overwriting with tokenizer id. 2024-05-09 17:59:48,137 - INFO: Setting pretraining_tp of model config to 1. 2024-05-09 17:59:48,159 - INFO: Using bfloat16 for backbone 2024-05-09 17:59:48,159 - INFO: Using Flash Attention 2. 2024-05-09 17:59:48,379 - ERROR: Unknown exception Traceback (most recent call last): File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/app_utils/handlers.py", line 337, in handle await experiment_push_to_huggingface_dialog(q) File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/app_utils/sections/experiment.py", line 1829, in experiment_push_to_huggingface_dialog publish_model_to_hugging_face( File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/app_utils/hugging_face_utils.py", line 108, in publish_model_to_hugging_face cfg, model, tokenizer = load_cfg_model_tokenizer( File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/app_utils/sections/chat.py", line 219, in load_cfg_model_tokenizer model = cfg.architecture.model_class(cfg) File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/src/models/text_causal_language_modeling_model.py", line 32, in init self.backbone, self.backbone_config = create_nlp_backbone( File "/home/ubuntu/h2o_llm_2024_05_04/./llm_studio/src/utils/modeling_utils.py", line 804, in create_nlp_backbone backbone = model_class.from_config(config, **kwargs) File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 437, in from_config return model_class._from_config(config, **kwargs) File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1401, in _from_config model = cls(config, **kwargs) File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1135, in init self.model = LlamaModel(config) File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 927, in init [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)] File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 927, in [LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)] File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 702, in init self.mlp = LlamaMLP(config) File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 219, in init self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 98, in init self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs)) File "/home/ubuntu/miniconda3/envs/h2o_llm_2024_05_04/lib/python3.10/site-packages/torch/utils/_device.py", line 77, in torch_function return func(*args, **kwargs) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 79.15 GiB of which 433.31 MiB is free. Including non-PyTorch memory, this process has 78.71 GiB memory in use. Of the allocated memory 78.21 GiB is allocated by PyTorch, and 28.51 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

tmostak avatar May 09 '24 18:05 tmostak

Right, for very large models that don't fit on a single GPU, we added a workaround that loads the full weights to CPU first and then shards across your GPUs before uploading. Can you try uploading the weights with cpu_shard in the device selection?

pascal-pfeiffer avatar May 09 '24 18:05 pascal-pfeiffer

And actually, I just tried removing our forced cast to float32 and back to float16 when using CPU. It might no longer be needed with recent dependencies upgrades.

We should improve at least the description here to reflect all things that are done under the hood. image

pascal-pfeiffer avatar May 09 '24 18:05 pascal-pfeiffer

Ah I didn't realize that's what cpu_shard did. It sounds like it will support bfloat16 then?

tmostak avatar May 09 '24 21:05 tmostak

Yes, cpu_shard supports bfloat16.

pascal-pfeiffer avatar May 10 '24 06:05 pascal-pfeiffer

Confirmed I can export to HF with bfloat16 when using the cpu_shard setting.

tmostak avatar May 12 '24 22:05 tmostak