mlx_lm with llama-3.3-70b-instruct works like base model in some case.
My prompt looks like this:
Provide a summary as well as a detail analysis of the following:
Then content to summarize goes next.
However, if I run the following,
mlx_lm.generate --model mlx-community/Llama-3.3-70B-Instruct-4bit --max-kv-size 30000 --max-tokens 2000 --temp 0.0 --top-p 0.9 --seed 1000 --system 'You are a helpful assistant' --prompt -<./28000.txt
I only get this:
"I hope this information has been helpful. If you have any further questions or need more information, please don't hesitate to ask."
I'm attaching the full prompt below.
Thanks!
That's odd. Does it still fail if you don't specify --max-kv-size?
Is it just for that prompt or do you observe the same for shorter prompts? What about other Llama models or just the 70B?
I discovered this when I created a script to test speed with various prompts lengths.
What's interesting is that when feeding 28k, 30k, 32k, it has the same problem where it only generates 27 tokens with the same phrase. When feeding Prompts with 26k tokens and less, it didn't have the problem.
I'm suspecting something might be going with long context? It's like opposite of the issues I created for looping problem with long context and llama-3.1-8b-instruct-4bit.
I'll test some more with what you suggested, and report back.
I mentioned in the other thread, but there was a bug with these Llama models causing duplicate BOS tokens that is now fixed and I wonder if that was impacting the results you see here?