Better generation stats
I'm currently facing an issue where the generation on a gpu sometimes slows down and its very hard to determine why. (see https://github.com/rustformers/llm/pull/325)
It would be great if we could have an option to get more detailed information from the generation process. Maybe we could divide the per token times into the following categories:
- Forward pass: Raw time spend in the
evaluatefunction of the model - Sampler: Time spend sampling the tokens
- Decoding: Time taken by the tokenizer to decode the tokens
- Printing: Time spend invoking the callback and printing to the CLI
It would also be helpful to see the max and min time of each category, alongside the mean
Sounds good to me, would anyone be interested in doing this?
I could give it a try but im still kinda bussy with the CUDA/OpenCL stuff and i have no idea how i would implement performance metrics and loggin correctly in rust 😬
You can probably just use std::time::Instant - it should be precise enough for this application. Just create some Instants at each measurement point, then call .elapsed() on them to find the amount of time that has passed since that instant.