Max Ren
Max Ren
@tdasika17 just wanted to follow up to make sure we have a path forward for this. Let me know if you're still encountering issues with the high inference times.
@tdasika17 could you share the profiled timings associated with the above table? The previous timings .txt files show that there were native_call_mm.out, however the ops lists shared don't have native_call_mm.out....
Hi @tdasika17 , based on your timings it looks like the model inference is only taking 71ms, and model load is around 167ms. I believe this should be significantly faster...
@tdasika17 thanks for the clarification! The executorch model should be expected to be faster than the pytorch model on CPU, but it might be helpful to share how the pytorch...
@tdasika17 yes this is definitely useful. I think the graph you sent is for a single thread. Would it be possible to share the entire .svg file? this would definitely...
would it be possible to share the flame graph files? it'll help with inspecting the call stacks. On cursory look at these, I can't immediately tell what the discrepancy is...
> For Torch model, I just ran the Torch c++ application and captured the graphs for C++ application execution what do you mean by this? What is the capture flow?...
We see some rather significant speed up on prefill performance for Llama Models: ### Before: ``` I 00:00:05.587790 executorch:stats.h:84] Prompt Tokens: 64 Generated Tokens: 63 I 00:00:05.587793 executorch:stats.h:90] Model Load...
@alankelly @gonnet