dolly Use use_cache=True config?

I run dolly-v2-3b on single 32G-V100 GPU, and find it runs relatively slow...

Apr 24 '23 09:04 GregxmHu

Should take a few seconds. How are you generating? did you see https://github.com/databrickslabs/dolly#generating-on-other-instances for example?

Apr 24 '23 12:04 srowen

I use model.generate() to generate sentences. Just like using other transformer models. I find it will cause about 2 seconds for dollly-v2-3b to generate a sentence when the max generated length was set to 200. But it will be very quickly for pythia-2.7b to do the same thing. I wonder what makes difference between the two models, because dolloy was built on the top of pythia.

Apr 24 '23 12:04 GregxmHu

Hm, shouldn't be any real difference there. Are you sure the settings are fairly equivalent and output length is the same (not just max)

Apr 24 '23 12:04 srowen

Hm, shouldn't be any real difference there. Are you sure the settings are fairly equivalent and output length is the same (not just max)

yeah, i'm sure i use the same code to generate...

Apr 24 '23 12:04 GregxmHu

Hm, shouldn't be any real difference there. Are you sure the settings are fairly equivalent and output length is the same (not just max)

what do you mean when you say "and output length is the same (not just max)" is there any other outpur length?

Apr 24 '23 12:04 GregxmHu

Hm, shouldn't be any real difference there. Are you sure the settings are fairly equivalent and output length is the same (not just max)

outputs =generator.generate(
                input_ids = input_ids,
                attention_mask= attention_mask,
                min_length = min_length,
                max_length = max_length,
                do_sample = True,
                top_k=None if top_k <0 else top_k,
                top_p=None if top_p <0 else top_p,
                pad_token_id=tokenizer.eos_token_id, 
                eos_token_id=tokenizer.eos_token_id
            )

this is the code i USE to generate

Apr 24 '23 12:04 GregxmHu

I just mean, how much output are you getting from each? the run time is proportional to the output size. You can't directly control it, but affects the comparison. I'm not sure how you load the models either, but they should be the same model, really, just different weights. We suggest you use it like this: https://github.com/databrickslabs/dolly#getting-started-with-response-generation

Apr 24 '23 12:04 srowen

I just mean, how much output are you getting from each? the run time is proportional to the output size. You can't directly control it, but affects the comparison. I'm not sure how you load the models either, but they should be the same model, really, just different weights. We suggest you use it like this: https://github.com/databrickslabs/dolly#getting-started-with-response-generation

I use the same code, and all the parameters passed to the generate method keep the same

Apr 24 '23 12:04 GregxmHu

I have settled the issues. The difference is in the config file in the huggingface. you set "use_cache=False" for dolly

Apr 24 '23 13:04 GregxmHu

Oh yeah, you don't want to measure time to download or load the model here. Make sure it's already loaded then time the generation

Apr 24 '23 13:04 srowen

Oh yeah, you don't want to measure time to download or load the model here. Make sure it's already loaded then time the generation

yeah i think you'd better change the huggingface config file. https://huggingface.co/databricks/dolly-v2-3b/blob/main/config.json

Apr 24 '23 13:04 GregxmHu

@matthayes I think this is a good point - the pythia models have use_cache=True. https://huggingface.co/databricks/dolly-v2-3b/blob/main/config.json#L29 I don't know a lot about this but seems like we would want to do the same.

Apr 24 '23 14:04 srowen

Matt notes that this was probably set to false because gradient checkpointing requires it to be off during training. But we can just edit the resulting model config for now to set use_cache to true. I did that, good catch! https://huggingface.co/databricks/dolly-v2-12b/blob/main/config.json

Apr 25 '23 16:04 srowen