Mateusz Piotrowski
Mateusz Piotrowski
I think the problem you're seeing is caused by the prompt formatting and not the implementation differences. I compared the TL to HF model and while there are some small...
Thanks for clarifying! From the issue description, I assumed that the problem is the generated tokens being in Chinese and that behavior is the same for the HF implementation for...
@yeutong the issue is caused by a different attention scale used (~14.96 vs 16). The HF implementation also disables the attention logits soft capping for inference, but that is less...
@microsoft-github-policy-service agree company="Anthropic"