exllama performance & quality drop (3x) when setting top

Performance when generating with top_p = 1.0 is about 3x slower than with top_p at anything(?) else, to duplicate try 0.99 and 1.0.

I've seen this bug with both the exllama gui and the text-gen-webui, a sample performance:

top_p=1.0 Output generated in 27.11 seconds (12.10 tokens/s, 328 tokens, context 49, seed 444648765) top_p=0.99 Output generated in 11.62 seconds (31.83 tokens/s, 370 tokens, context 49, seed 1821729798)

The top_p=1.0 also seems to produce drastically poorer results sometimes as well... I'm not sure what might be happening there, but here is a looney example from a normally very good model (TheBloke_Wizard-Vicuna-30B-Uncensored-GPTQ):

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

USER: A recipe for easy chocolate cake.
ASSISTANT: Ingredients:
- 1 cup all-purpose flour
- 1/2 cup granulated sugar
- 3/4 cup unsweetened cocoa powder
- 1 tsp baking soda
- 1 tsp salt
- 1/2 cup vegetable oil
- 50 grams canola oil 
- Roughly chopped dark chocolate (at least 6 oz)
- 2 large eggs
- 1 tsp vanilla extract
- 1 cup water
Instructions:
1. Preheat your oven to 350°F (180°C). Grease a 9x13 inch pan with cooking spray or butter and flour.
2. In a medium bowl, mix together the flour, sugar, cocoa powder, baking soda, and salt until well combined. Set aside.
3. In another bowl, whisk together the oil, vegetable oil, eggs, and vanilla extract until creamy. Add wet ingredients into dry ingredients and stir until it’s just moistened. Gently fold rough chopped target="_blank" data-qa="true">
4. Pour batter in greased pan and smooth out top with a spatula or back of most can dipper fan site memory foam side travel marketing pinboard plus ideas epochtransduction credit peaceful solve ways relationship lose bond light building car west rightoffscreen eXtreme target two links and years pondchildhood free online festival style throwMore nobatsevidenc rebate some use joke birdhit sku

In step 4 it just loses it at the end, but it's bad at the end of step 3 also.

This is a typical output (using the same parameters, but with top_p=0.99):

USER: A recipe for easy chocolate cake.
ASSISTANT: Ingredients:
- 1 cup all-purpose flour
- 3/4 cup granulated sugar
- 2 teaspoons baking powder
- 1/2 teaspoon salt
- 1/2 cup unsweetened cocoa powder
- 1 egg
- 1/2 cup milk
- 1/2 cup vegetable oil
- 1 tablespoon vanilla extract
Instructions:
1. Preheat oven to 350°F (175°C). Grease a 9-inch square pan with cooking spray or butter.
2. In a large bowl, whisk together flour, sugar, baking powder, and salt until well combined. Add in cocoa powder and mix again.
3. In another small bowl, beat the egg then add milk, vegetable oil, and vanilla extract. Mix well.
4. Pour wet ingredients into dry ingredients and stir until just combined. Do not overmix.
5. Pour batter into prepared pan and smooth out top with a spatula.
6. Bake for 35-40 minutes or until a toothpick inserted into center comes out clean. Let cool completely before frosting. Enjoy!

Jun 20 '23 14:06 matatonic

I'm not sure if this is really an issue or not. The performance is likely down to the way the sampler is optimized for reasonable values of top-p under an assumption that only a few tokens always make up the vast majority of the cumulative probability. For top-p = 0.9 you'll usually only have to consider 1-10 tokens or thereabout, so it doesn't make sense to create a mask over the entire 32000-token vocabulary when you can just pick out 10 or so tokens from the beginning of the sorted set.

The difference in your case would be that it only takes maybe 50 iterations or so for the sum of probabilities to exceed 0.99, while by definition it takes 32000 iterations to reach a sum of 1. It ends up being a rather expensive no-op. Top-p = 0 (to disable it altogether) would do the same thing, only much faster.

As for the difference in the output, then: with top-p = 0.99, the sampler disregards 1% of the range of the CDF, not the domain. All the stray Korean characters, HTML tags, poop emojis, and so on reside in the long tail end of the distribution that gets chopped off at top-p = 0.99 but is included at top-p = 1. You're essentially asking the generator to pick a "bad" token 1% of the time. And after that inevitably happens, you're asking it to complete a sequence that contains whatever that bad token was. So the whole process starts to diverge.

Jun 20 '23 15:06 turboderp

interesting... The reason I reported the quality drop it is because I don't see it with AutoGPTQ (otherwise the same settings) when top_p=1.0 with the same model. I usually test using essentially the 'debug deterministic" settings. temp=1.0, top_p=1.0, top_k=0 or 1, typical_p=1.0

Jun 20 '23 16:06 matatonic

Different implementations are going to perform differently in extreme cases. You could also turn up the temperature and magnify any differences that way.

But chasing perfectly deterministic behavior with CUDA gets expensive since you can't use the most efficient parallel algorithms then. And language models are designed to deal with some amount of noise anyway. Dropout during training is essentially just a whole lot of strong, artificial noise added to the network so it learns to be resilient and avoid tipping points where small deviations can have large cascading effects.

And whatever small amount of noise makes it through the model therefore shouldn't change the overall shape of the output, though it might make it a little fuzzier. Fuzziness doesn't matter when you're gating the output anyway, either with top-p or top-k or some other sampling filter, because then you're left with only the part of the distribution that has a high signal-to-noise ratio.

The tail end, where the noise can overshadow the signal, is not what you want to rely on in any case. That's where the model basically goes, "well, if you didn't like any of the two hundred tokens I suggested as the most likely continuation, maybe the next one is, oh, I don't know, 'ham'? Does that make sense in a chocolate cake recipe? Apparently you don't care!"

Those settings also look a little strange if you're going for determinism. with a top-k of 1 the model is as deterministic as it can be (given the non-associative floating point operations as mentioned), and none of the other parameters contribute anything at all. With top-k at 0 and the others at 1, none of the filters do anything, and that's arguably the least deterministic form of sampling.

Jun 20 '23 19:06 turboderp

Thanks for the detailed explanation, it's so much clearer for me now! My headache just went away, lol.

Jun 20 '23 20:06 matatonic

exllama
exllama copied to clipboard

performance & quality drop (3x) when setting top_p = 1.0 vs. 0.99

exllama exllama copied to clipboard

performance & quality drop (3x) when setting top_p = 1.0 vs. 0.99

exllama
exllama copied to clipboard