fastertransformer_backend Memory usage is doubled when loading a fp16 model into bf16

Memory usage is doubled when loading a fp16 model into bf16

Open skyser2003 opened this issue 2 years ago • 2 comments

Description

Model: Gpt-NeoX
GPU: A100
Tritonserver version: 22.12

Hello, I'm not sure whether this is FasterTransformer's issue or backend's issue, but still I'm reporting it here.

As the title says, I have my model trained originally with fp16 on huggingface, and I converted it to FasterTransformer weight format.

This is the command I used to convert, and the size of the result folder.

python huggingface_gptneox_convert.py -o {output_dir} -i {hf_model_dir} -infer_gpu_num 1 -model_name neox_model -weight_data_type fp16

$ du -h -d 1
25G     ./1-gpu
25G     .

As the command prints out, FasterTransformer converted output weight folder is 25GB, and original huggingface model's size is also 25GB.

Problem occurs when I load it using tritonserver and fastertransformer_backend. When I load it using fp16, it just loads fine.

I0906 06:12:34.269131 83 libfastertransformer.cc:438] Before Loading Weights:
after allocation    : free: 78.56 GB, total: 79.15 GB, used:  0.60 GB
I0906 06:12:56.704958 83 libfastertransformer.cc:448] After Loading Weights:
after allocation    : free: 54.54 GB, total: 79.15 GB, used: 24.61 GB

But when I load it with bf16, it suddenly takes up twice the memory.

I0906 06:10:11.016121 83 libfastertransformer.cc:438] Before Loading Weights:
after allocation    : free: 78.56 GB, total: 79.15 GB, used:  0.60 GB
I0906 06:11:07.674020 83 libfastertransformer.cc:448] After Loading Weights:
after allocation    : free: 30.52 GB, total: 79.15 GB, used: 48.63 GB

I guess taking twice the memory means that is is loaded as fp32, so does it mean then you can't load a model saved as fp16 into bf16, or is it that just Gpt-NeoX model doesn't support bf16 format?

Reproduced Steps

In config.pbtxt

For fp16

parameters {
  key: "data_type"
  value: {
    string_value: "fp16"
  }
}

For bf16

parameters {
  key: "data_type"
  value: {
    string_value: "bf16"
  }
}

Sep 06 '23 06:09 skyser2003

Did you ever find a fix for this?

Mar 18 '24 05:03 devin12422

@devin12422 No, since FasterTransformer is deprecated and TensorRT-LLM succeeded it, just used tensorrtllm_backend and it seemed to work fine.

Mar 18 '24 08:03 skyser2003

fastertransformer_backend fastertransformer_backend copied to clipboard

Memory usage is doubled when loading a fp16 model into bf16

Description

Reproduced Steps

fastertransformer_backend
fastertransformer_backend copied to clipboard