exo
exo copied to clipboard
The return of step function is time consuming
I run exo on M1 macbook pro with 16GB RAM and 4.58 TFLOPS GPU. In theory, decode should be very fast,but in fact it can only generate 10 tokens per second. After debugging the code, the return of StatefulShardedModel.step function in exo\inference\mlx\sharded_model.py file consume lots of time. Just one line code of return. I wonder what happened to this return and why it took so much time.