LongICLBench
LongICLBench copied to clipboard
Unexpected low performance of mistral-v0.1
Hi!
I tried to run BANKING77 with mistral-7b-v0.1. However, I got a nearly zero performance. I have no idea about how this could happen. Have you ever tested mistral-7b-v0.1? Or, if not, did you ever encounter similar issues?
Thanks!
v0.1 only supports 8K token length, which leads to low performance. We use v0.2 because it supports 32K tokens.
v0.1 only supports 8K token length, which leads to low performance. We use v0.2 because it supports 32K tokens.
The first 3 subsets of BANKING77 are below 8k. So, on the three subsets, I expected that mistral-7b-v0.1 should have a performance lower than v0.2 but at least comparable to other models. However, I got a performance of (nearly) zero.
Using the same codes and replacing mistral-7b-v0.1 with mistral-7b-v0.2, I can get similar results as those reported in the paper.
I'm not sure what I did wrong 🙃
We observe similar trend for Gemma, which should support 8K but actually failed on tasks less than 8K tokens.
Got it. Seems like some prompting stuffs. I will try to modify the prompts to see if there is any difference.
Thank you for your clarification!