gpt-2 slow inference
Using gpt-2 345M model to run inferences in batches between 10 and 100 documents with approximately ~60 tokens is taking ~15ms in a Tesla T4 GPU machine. Why? That looks really bad if someone would like to hook this up in a realtime pipeline... It means ~66 inferences/second which isn't sufficient for many realtime systems. Am I losing something? Is there any up-to-date benchmark I can use to compare my numbers?
pytorch 1.4.0 transformers 2.3.0 CUDA 10.1 apex 0.1
apex speeds up the generation? I tested fp16 on finetuning (tensorflow) and it didn’t seem to help reduce the amount of memory
I reduced the amount of memory with torch.no_grad()... But fp16 definitely improved the inference times in pytorch. What times are you getting for tensorflow and for what length of tokens?