gpt-fast
                                
                                 gpt-fast copied to clipboard
                                
                                    gpt-fast copied to clipboard
                            
                            
                            
                        Reducing Latency in Application with Torch Compilation: Initialization and Inference Optimization
I can run the script successfully as explained in the repository, such as creating a quantized model and then running it with generate.py. However, the actual issue arises when I try to implement it into the application. Of course, our main goal is to reduce latency, so we don't want the compilation (torch.compile) to be done at every request.
Thus, I want to keep the compilation process in the initialization stage of the application, aiming for it to run only once. Is it possible? Because even if I run the generate function separately after compilation, the first time always takes a long time, and then the later inferences are at a good speed.
For example, what I am trying to do is:
First, I run the main function so that the compilation runs. It compiles:
snippet1
model_r, encoded_r, callback_r, tokenizer_r, model_size_r, prof_r = main(prompt= "Hello, my name is",
interactive = False,
num_samples= 1,
max_new_tokens = 128,
top_k = 200,
temperature = 0.8,
checkpoint_path: Path = Path("------model_int8.pth"),
compile = True,
compile_prefill = False,
profile = None,
draft_checkpoint_path = None,
speculate_k= 5,
device='cuda',
)
Then, I run only the generate function with a new prompt, which will be coming from the user at each request: snippet2
prompt_2 = "Machine Learning is"
encoded_r_2 = encode_tokens(tokenizer_r, prompt_2, bos=True, device='cuda')
prompt_length_r_2 = encoded_r_2.size(0)
prompt_length_r_2
with prof_r:
    y, metrics = generate(
                model_r,
                encoded_r_2,
                512,
                draft_model=None,
                speculate_k=5,
                interactive=False,
                callback=callback_r,
                temperature=0.9,
                top_k=200,
            )
My approach is to call the main function during the initialization of the application so that we can run the compilation. Then, I call the generate function (snippet2) inside my inference function, which will be called at every request from the frontend.
However, the ISSUE is that when I do this, the first time snippet2 is called, it takes too long (7tokens/sec). Afterward, it runs at a good speed (90 tokens/sec). I don't understand why it runs very slowly the first time, even though there is no compilation done.
And is there any other way to cache the compilation results so that we can use it only for inference seamlessly???
Please help me understand this issue.