llm Better generation stats

I'm currently facing an issue where the generation on a gpu sometimes slows down and its very hard to determine why. (see https://github.com/rustformers/llm/pull/325)

It would be great if we could have an option to get more detailed information from the generation process. Maybe we could divide the per token times into the following categories:

Forward pass: Raw time spend in the evaluate function of the model
Sampler: Time spend sampling the tokens
Decoding: Time taken by the tokenizer to decode the tokens
Printing: Time spend invoking the callback and printing to the CLI

Jun 25 '23 15:06 LLukas22

It would also be helpful to see the max and min time of each category, alongside the mean

Jun 25 '23 17:06 jafioti

Sounds good to me, would anyone be interested in doing this?

Jun 25 '23 20:06 philpax

I could give it a try but im still kinda bussy with the CUDA/OpenCL stuff and i have no idea how i would implement performance metrics and loggin correctly in rust 😬

Jun 26 '23 11:06 LLukas22

You can probably just use std::time::Instant - it should be precise enough for this application. Just create some Instants at each measurement point, then call .elapsed() on them to find the amount of time that has passed since that instant.

Jun 26 '23 23:06 philpax