exo broken tinygrad responses

I'm getting bad responses from tinygrad (except for the first). Running on an M2 Mac Mini, I've hardcoded the inference_engine_name value to use tinygrad instead of MLX. Seeing the same thing happen for 3B too, I haven't tried other models. Will update if I find more info.

MLX output:

Tinygrad output:

Nov 25 '24 15:11 roryclear

Why is this a "bad" output?

tinygrad and MLX are using slightly different models. It's one of the magical things about exo: different models are interoperable.

Nov 25 '24 17:11 AlexCheema

in the tinygrad screenshot it hasn't answered what I've asked in the second prompt at all. Try having a conversation with 1B using MLX and then tinygrad, I'm just getting nonsense and/or irrelevant responses after the first, it's like it's not receiving the prompts correctly or something.

Didn't realize the models were different, if this is just down to the models being different then obv not an issue, does feel like more than a slight difference though

Nov 25 '24 17:11 roryclear

in the tinygrad screenshot it hasn't answered what I've asked in the second prompt at all. Try having a conversation with 1B using MLX and then tinygrad, I'm just getting nonsense and/or irrelevant responses after the first, it's like it's not receiving the prompts correctly or something.

Didn't realize the models were different, if this is just down to the models being different then obv not an issue, does feel like more than a slight difference though

The example you gave has context of the previous part of the conversation. Can you give an example where it doesn't have context of the previous part?

The example you gave seems totally reasonable.

Nov 25 '24 17:11 AlexCheema

ah this might explain more MLX: Tinygrad:

Nov 25 '24 17:11 roryclear

@AlexCheema Yea, this looks like a context bug to me, and makes an argument for spending some time reconciling the different caching methods between these implementations, and fully utilizing the caching for inference over a chat session rather than stacking prompts within the API. This will also fix some bugs that happen during a long session due to passing too large a context into the inference model

This kind of change will dovetail well with some other inference engine compatibility generalizations, so it seems like a good thing to do

Nov 25 '24 21:11 blindcrone

@AlexCheema Yea, this looks like a context bug to me, and makes an argument for spending some time reconciling the different caching methods between these implementations, and fully utilizing the caching for inference over a chat session rather than stacking prompts within the API. This will also fix some bugs that happen during a long session due to passing too large a context into the inference model

This kind of change will dovetail well with some other inference engine compatibility generalizations, so it seems like a good thing to do

Agree, that seems like a good strategy to fix this and probably a bunch of other unknown bugs.

Nov 26 '24 12:11 AlexCheema

I found this doesn't happen with JIT off, I'm going to try narrow it down now. I think ~it's likely~ it could be a bug in tinygrad, I've found a similar bug before with it. edit: I couldn't find anything obvious in tinygrad. I ran into issues running old versions of exo, so I haven't been able to do a bisect properly

Dec 02 '24 12:12 roryclear

Stale for 1.0

Dec 18 '25 20:12 Evanev7