gpt-2 slow inference

Open Damiox opened this issue 6 years ago • 2 comments

Using gpt-2 345M model to run inferences in batches between 10 and 100 documents with approximately ~60 tokens is taking ~15ms in a Tesla T4 GPU machine. Why? That looks really bad if someone would like to hook this up in a realtime pipeline... It means ~66 inferences/second which isn't sufficient for many realtime systems. Am I losing something? Is there any up-to-date benchmark I can use to compare my numbers?

pytorch 1.4.0 transformers 2.3.0 CUDA 10.1 apex 0.1

Mar 04 '20 02:03 Damiox

apex speeds up the generation? I tested fp16 on finetuning (tensorflow) and it didn’t seem to help reduce the amount of memory

Mar 05 '20 12:03 saippuakauppias

I reduced the amount of memory with torch.no_grad()... But fp16 definitely improved the inference times in pytorch. What times are you getting for tensorflow and for what length of tokens?

Mar 10 '20 18:03 Damiox