fastertransformer_backend T5 not performing as expeceted

Description

I am trying to optimize T5-small inference using Fastertransformer. I am running on a single V100, I followed all the steps in `t5_guide.md` exactly and got a sensible BLEU score. And yet, when measuring performance of inference (with the time it takes to set `InputTensor`s of the client etc, the performance boost is far from x22 as promoted in the related blogpost. I was not able to run it using `fp16` as the model is not stable enough (this has been mentioned multiple times in the `transformers` repo.
Am I missing something? Is there a way to run with `fp16` that I am not aware of?

Thanks in advance for your reply,

N

Reproduced Steps

Follow the T5 guide/blogpost.

Oct 31 '22 11:10 nrakltx

Can you share the scripts you use to run t5-small, and also share the results you see?

Nov 01 '22 00:11 byshiue

I ran the e2e script in t5_utils and got 6500~ tokens encoded in 25 seconds. This is the same time it takes for PyTorch.

Nov 02 '22 08:11 nrakltx

Can you post "your scripts" and the "results show in terminal"?

Nov 02 '22 10:11 byshiue