Awni Hannun
Awni Hannun
Your solutions sounds reasonable: - Force the order for CUDA - IFDEF on heap for windows
Awesome progress so far @zcbenz !! I'm wondering what the best way to get this incorporated into MLX. I can think of a couple of options: - Once this is...
> This comes with a limitation of maximum ndim in arrays, which PyTorch sets to 25, I'm using 8 for now and it can be easily changed if found not...
> some of them are the slow `Event::is_signaled` calls that need to be improved. Where are those calls coming from? Is it [here](https://github.com/ml-explore/mlx/blob/main/mlx/transforms.cpp#L206-L207)? We might be able to reduce the...
Very nice @zcbenz ! > To get rid of this latency, I improved the CUDA backend by saving operands and temporaries of the op until finalize() is called, i.e. when...
Hmm, that's the gradient of product which I believe uses cumprod. This is basically a duplicate of https://github.com/ml-explore/mlx/issues/673 which is that scan ops don't currently work on 64-bit types. We...
In case it's useful here is a [reference implementation](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/examples/chat.py#L12-L42) in the Python version of MLX LM.
The length of the cache (in tokens) + any generated text should stay below the maximum context size of the model. It's not checked in the Python version though as...
> I do wish there were an easier way to delegate the parameters we'd want to use with Muon and others with AdamW Can you say more about that?
Thanks for the detailed explanation, that makes sense!