llama-recipes
llama-recipes copied to clipboard
Questions about dtype of model weights.
System Info
PyTorch version: 2.0.1 transformers version: 4.31.0
OS: Ubuntu 16.04.7 LTS (x86_64) Python version: 3.10.11 (main, Apr 20 2023, 19:02:41) [GCC 11.2.0] (64-bit runtime) Is CUDA available: True
Information
- [ ] The official example scripts
- [X] My own modified scripts
🐛 Describe the bug
When run following code
from transformers import LlamaForCausalLM, LlamaTokenizer
model = LlamaForCausalLM.from_pretrained(
"/disk/Llama-2-7b-hf",
device_map='auto',
)
print(model.config.torch_dtype)
the print result is torch.float16
, but the model weights are actually loaded in the type of ``torch.float32`, which can be verified by following code
for name, param in model.named_parameters():
print(name, param.dtype)
The GPU mem usage (about 28G in total) also indicates those types are float32. Is this a normal and expected behavior, i.e., Llama2 is loaded in float32 by default? I believe it's a bug as torch.float16
is the value of torch_dtype
field in the Llame2 config file.
Above issue leads another question to me: Should we use float32 or float16 llama2 for finetuning and inference? which is standard practice?
To load a Llama model in float16, we need to manually set torch_dtype=torch.float16
in from_pretrained
method. But it's not always set like that in the code of this official repo, and torch_dtype
is neglected. For example, code does not set torch_dtype
, so a model in float32 will be loaded rather than float16. Is this a bug or we indeed want this model being float32?
Error logs
see bug description.
Expected behavior
see bug description.
@ParadoxZW thanks for highlighting this, I assume the issue with print(model.config.torch_dtype)
might be some legacy code, will follow up on that front.
Above issue leads another question to me: Should we use float32 or float16 llama2 for finetuning and inference? which is standard practice?
it would depend on the your specific case/ dataset, setting/ hadrware etc. But overall our experience specially on FSDP side with BF16 was more successful, as scaling factor calculation in fp16 for LLMs sometimes might be challenging.
To load a Llama model in float16, we need to manually set torch_dtype=torch.float16 in from_pretrained method. But it's not always set like that in the code of this official repo, and torch_dtype is neglected.
In case you are running on single GPU and using quantization, it move your model automatically to Fp16. In case of FSDP, we would leaning toward BF16 with pure_bf16
. We will add the option for FP16 loading as well.
Thanks for your reply. But I'm not sure if I've made my point clear. And, sorry, I still have some questions.
I assume the issue with
print(model.config.torch_dtype)
might be some legacy code
It's not just an issue about model.config.torch_dtype
. The core problem is, using defaults setting LlamaForCausalLM.from_pretrained("./Llama-2-7b-hf", device_map='auto')
, we actually get a fp32 model. But in config file, it says "torch_dtype": "float16"
. So which one is correct? What dtype do we expect at first?
But overall our experience specially on FSDP side with BF16 was more successful, as scaling factor calculation in fp16 for LLMs sometimes might be challenging.
I've noticed that you've not mentioned 'fp32'. Does it mean we will not use fp32?
In case you are running on single GPU and using quantization, it move your model automatically to Fp16. In case of FSDP, we would leaning toward BF16 with pure_bf16.
In case I'm running on multiple GPUs and not using quantization, what dtype should be used?
@ParadoxZW thanks for highlighting this, I assume the issue with
print(model.config.torch_dtype)
might be some legacy code, will follow up on that front.Above issue leads another question to me: Should we use float32 or float16 llama2 for finetuning and inference? which is standard practice?
it would depend on the your specific case/ dataset, setting/ hadrware etc. But overall our experience specially on FSDP side with BF16 was more successful, as scaling factor calculation in fp16 for LLMs sometimes might be challenging.
To load a Llama model in float16, we need to manually set torch_dtype=torch.float16 in from_pretrained method. But it's not always set like that in the code of this official repo, and torch_dtype is neglected.
In case you are running on single GPU and using quantization, it move your model automatically to Fp16. In case of FSDP, we would leaning toward BF16 with
pure_bf16
. We will add the option for FP16 loading as well.
By using --quantization
option in inference.py or chat_completeion.py, each parameter in the model should be in int8 or say 8bits, which is 1 byte, but not Fp16, as it triggers load_in_8bit
option in transformers.LlamaForCausalLM.from_pretrained
. Am I right?
I kind of have the similar question, say, if I only want to use the model for inference, should I use 16bits (bf16) or single precision float(float32). I assume though initial weights can only as precise as bf16 or fp16, but by using float32, through only forward propagation, the results could be more precise than keeping parameters in 16bits?
Can I also assume that through the original training progress for llama 2, the dtype for the model was bf16, but when dumping the model to the disk, for compatibility for old version GPUs, dtype was transformed to fp16? Or say, for what reason, dtype was changed to fp16 when the model is saved or transformed to huggingface format, if bf16 might be the most suitable one for training.
I use pytorch to load the original model's pth
file into "cpu" or "gpu" memory, dtypes for parameters all seems to be bf16.
So, when using transformers/models/llama/convert_llama_weights_to_hf.py
, which transoforms parameters' dtype from bf16 to fp16, would there be a sudden precision loss?
So, when using
transformers/models/llama/convert_llama_weights_to_hf.py
, which transoforms parameters' dtype from bf16 to fp16, would there be a sudden precision loss?
Well, I update the transformers code to the latest one, and in the config json file, I get dtype as bfloat16 instead of float16. So I think there IS something to do with the legacy code.
@ParadoxZW thanks for highlighting this, I assume the issue with
print(model.config.torch_dtype)
might be some legacy code, will follow up on that front.Above issue leads another question to me: Should we use float32 or float16 llama2 for finetuning and inference? which is standard practice?
it would depend on the your specific case/ dataset, setting/ hadrware etc. But overall our experience specially on FSDP side with BF16 was more successful, as scaling factor calculation in fp16 for LLMs sometimes might be challenging.
To load a Llama model in float16, we need to manually set torch_dtype=torch.float16 in from_pretrained method. But it's not always set like that in the code of this official repo, and torch_dtype is neglected.
In case you are running on single GPU and using quantization, it move your model automatically to Fp16. In case of FSDP, we would leaning toward BF16 with
pure_bf16
. We will add the option for FP16 loading as well.By using
--quantization
option in inference.py or chat_completeion.py, each parameter in the model should be in int8 or say 8bits, which is 1 byte, but not Fp16, as it triggersload_in_8bit
option intransformers.LlamaForCausalLM.from_pretrained
. Am I right? I kind of have the similar question, say, if I only want to use the model for inference, should I use 16bits (bf16) or single precision float(float32). I assume though initial weights can only as precise as bf16 or fp16, but by using float32, through only forward propagation, the results could be more precise than keeping parameters in 16bits? Can I also assume that through the original training progress for llama 2, the dtype for the model was bf16, but when dumping the model to the disk, for compatibility for old version GPUs, dtype was transformed to fp16? Or say, for what reason, dtype was changed to fp16 when the model is saved or transformed to huggingface format, if bf16 might be the most suitable one for training.
I think I get that point why you said about "when doing quantization, it move your model automatically to Fp16.", as I got this warning from bnb like bitsandbytes/autograd/_functions.py:321: UserWarning: MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
Thanks for your reply. But I'm not sure if I've made my point clear. And, sorry, I still have some questions.
I assume the issue with
print(model.config.torch_dtype)
might be some legacy codeIt's not just an issue about
model.config.torch_dtype
. The core problem is, using defaults settingLlamaForCausalLM.from_pretrained("./Llama-2-7b-hf", device_map='auto')
, we actually get a fp32 model. But in config file, it says"torch_dtype": "float16"
. So which one is correct? What dtype do we expect at first?But overall our experience specially on FSDP side with BF16 was more successful, as scaling factor calculation in fp16 for LLMs sometimes might be challenging.
I've noticed that you've not mentioned 'fp32'. Does it mean we will not use fp32?
In case you are running on single GPU and using quantization, it move your model automatically to Fp16. In case of FSDP, we would leaning toward BF16 with pure_bf16.
In case I'm running on multiple GPUs and not using quantization, what dtype should be used?
Hey @ParadoxZW, wondering if you have find a good answer on the first point? I am quite troubled by that when using defaults setting LlamaForCausalLM.from_pretrained("./Llama-2-7b-hf", device_map='auto'), we actually get a fp32 model. Based on the huggingface official post, dtype should haved been casted to fp16... I guess we can always choose to load with bf16, but just wondering if there's reason why the default loading give you fp32.
The only thing I could think of is that, by loading model weights as fp32 during pretraining/SFT, one has the option to perform mixed_precision and keep some layers (e.g., batchnorm in fp32). You lose this flexibility when loading as bf16 as it will be pure bf16. However at inference stage, it likely won't make sense to load into fp32 as weights are trianed in bf16. Wondering if you feel this thought to be right? @HamidShojanazeri
Many thanks!
The only thing I could think of is that, by loading model weights as fp32 during pretraining/SFT, one has the option to perform mixed_precision and keep some layers (e.g., batchnorm in fp32). You lose this flexibility when loading as bf16 as it will be pure bf16. However at inference stage, it likely won't make sense to load into fp32 as weights are trianed in bf16. Wondering if you feel this thought to be right?
@hanyin88 thats right, for inference we are still using int8, however, you should use bf16 as alternative.
Hi! It seems that the question in this issue has been solved and no follow up questions for some time. I am closing this issue now but feel free to re-open it if there are more questions coming up.