Awni Hannun

Results 1014 comments of Awni Hannun

How long is that prompt? Do you mind copying it here in text form so I can try it directly?

Can confirm.. it's really slow on the longer prompt.

@jeanromainroy if it's possible, can you try rebooting your machine. That seems to resolve the speed issue on my end. I can generate quite quickly with the prompt you provided.

Ok let me see if I can reproduce the bad state. Just starting and killing the flask server is enough to make it slow down? That's pretty wild.

I ran the server / flask app you posted, then ctrl+c it. Then run the model regularly and it is the same speed (generating reasonably fast, e.g. about 7.5 tps)....

Thanks for the data point. Still looking into a better solution for that.

> With MLX it seems to use all of the GPU, but when it starts generating the output, it drops significantly and seems to rely on CPU. What do you...

Did you try setting the sysctl `sudo sysctl iogpu.disable_wired_collector=1`? That usually helps.

I think it's only available on 15.0. Did it improve the token generation speed / GPU utilization?

This is cool, and I think it would be nice to support. We might be able to do it with a far smaller diff however. Something like: - Have a...