fastertransformer_backend
fastertransformer_backend copied to clipboard
Memory usage is doubled when loading a fp16 model into bf16
Description
Model: Gpt-NeoX
GPU: A100
Tritonserver version: 22.12
Hello, I'm not sure whether this is FasterTransformer's issue or backend's issue, but still I'm reporting it here.
As the title says, I have my model trained originally with fp16 on huggingface, and I converted it to FasterTransformer weight format.
This is the command I used to convert, and the size of the result folder.
python huggingface_gptneox_convert.py -o {output_dir} -i {hf_model_dir} -infer_gpu_num 1 -model_name neox_model -weight_data_type fp16
$ du -h -d 1
25G ./1-gpu
25G .
As the command prints out, FasterTransformer converted output weight folder is 25GB, and original huggingface model's size is also 25GB.
Problem occurs when I load it using tritonserver and fastertransformer_backend. When I load it using fp16, it just loads fine.
I0906 06:12:34.269131 83 libfastertransformer.cc:438] Before Loading Weights:
after allocation : free: 78.56 GB, total: 79.15 GB, used: 0.60 GB
I0906 06:12:56.704958 83 libfastertransformer.cc:448] After Loading Weights:
after allocation : free: 54.54 GB, total: 79.15 GB, used: 24.61 GB
But when I load it with bf16, it suddenly takes up twice the memory.
I0906 06:10:11.016121 83 libfastertransformer.cc:438] Before Loading Weights:
after allocation : free: 78.56 GB, total: 79.15 GB, used: 0.60 GB
I0906 06:11:07.674020 83 libfastertransformer.cc:448] After Loading Weights:
after allocation : free: 30.52 GB, total: 79.15 GB, used: 48.63 GB
I guess taking twice the memory means that is is loaded as fp32, so does it mean then you can't load a model saved as fp16 into bf16, or is it that just Gpt-NeoX model doesn't support bf16 format?
Reproduced Steps
In config.pbtxt
For fp16
parameters {
key: "data_type"
value: {
string_value: "fp16"
}
}
For bf16
parameters {
key: "data_type"
value: {
string_value: "bf16"
}
}
Did you ever find a fix for this?
@devin12422 No, since FasterTransformer is deprecated and TensorRT-LLM succeeded it, just used tensorrtllm_backend and it seemed to work fine.