bhsueh_NV
bhsueh_NV
Close this bug because it is inactivated. Feel free to re-open this issue if you still have any problem.
1. It is possible. When you convert the model successfully, the model configuration will be saved in a config file and the backend will read it, like [here](https://github.com/triton-inference-server/fastertransformer_backend/blob/main/all_models/t5/fastertransformer/1/config.ini). 2. Please...
Close this bug because it is inactivated. Feel free to re-open this issue if you still have any problem.
It requires to modify the source codes.
If you use decoder op, then the output is the results of transformer block. If you use decoding op, then you need to modify the op and FT source codes.
We don't verify on all possible GPUs, but it should work. If you encounter error, it is welcome to file a bug here.
Close this bug because it is inactivated. Feel free to re-open this issue if you still have any problem.
Currently, we don't find obvious drop when running FP16 on BF16 model. We also support BF16 supporting on GPT, you can save model as FP32 and run inference on bf16....
bfloat16 calculation is supported in most model in latest release. Because we don't have good way to save the bfloat16 weight now, so you still need to store the model...