broken tinygrad responses
I'm getting bad responses from tinygrad (except for the first). Running on an M2 Mac Mini, I've hardcoded the inference_engine_name value to use tinygrad instead of MLX. Seeing the same thing happen for 3B too, I haven't tried other models. Will update if I find more info.
MLX output:
Tinygrad output:
Why is this a "bad" output?
tinygrad and MLX are using slightly different models. It's one of the magical things about exo: different models are interoperable.
in the tinygrad screenshot it hasn't answered what I've asked in the second prompt at all. Try having a conversation with 1B using MLX and then tinygrad, I'm just getting nonsense and/or irrelevant responses after the first, it's like it's not receiving the prompts correctly or something.
Didn't realize the models were different, if this is just down to the models being different then obv not an issue, does feel like more than a slight difference though
in the tinygrad screenshot it hasn't answered what I've asked in the second prompt at all. Try having a conversation with 1B using MLX and then tinygrad, I'm just getting nonsense and/or irrelevant responses after the first, it's like it's not receiving the prompts correctly or something.
Didn't realize the models were different, if this is just down to the models being different then obv not an issue, does feel like more than a slight difference though
The example you gave has context of the previous part of the conversation. Can you give an example where it doesn't have context of the previous part?
The example you gave seems totally reasonable.
ah this might explain more
MLX:
Tinygrad:
@AlexCheema Yea, this looks like a context bug to me, and makes an argument for spending some time reconciling the different caching methods between these implementations, and fully utilizing the caching for inference over a chat session rather than stacking prompts within the API. This will also fix some bugs that happen during a long session due to passing too large a context into the inference model
This kind of change will dovetail well with some other inference engine compatibility generalizations, so it seems like a good thing to do
@AlexCheema Yea, this looks like a context bug to me, and makes an argument for spending some time reconciling the different caching methods between these implementations, and fully utilizing the caching for inference over a chat session rather than stacking prompts within the API. This will also fix some bugs that happen during a long session due to passing too large a context into the inference model
This kind of change will dovetail well with some other inference engine compatibility generalizations, so it seems like a good thing to do
Agree, that seems like a good strategy to fix this and probably a bunch of other unknown bugs.
I found this doesn't happen with JIT off, I'm going to try narrow it down now. I think ~it's likely~ it could be a bug in tinygrad, I've found a similar bug before with it. edit: I couldn't find anything obvious in tinygrad. I ran into issues running old versions of exo, so I haven't been able to do a bisect properly
Stale for 1.0