The return of step function is time consuming

Open HysenX-LI opened this issue 1 year ago • 0 comments

I run exo on M1 macbook pro with 16GB RAM and 4.58 TFLOPS GPU. In theory, decode should be very fast，but in fact it can only generate 10 tokens per second. After debugging the code, the return of StatefulShardedModel.step function in exo\inference\mlx\sharded_model.py file consume lots of time. Just one line code of return. I wonder what happened to this return and why it took so much time.

Sep 30 '24 02:09 HysenX-LI