dolly icon indicating copy to clipboard operation
dolly copied to clipboard

Use use_cache=True config?

Open GregxmHu opened this issue 2 years ago • 12 comments

I run dolly-v2-3b on single 32G-V100 GPU, and find it runs relatively slow...

GregxmHu avatar Apr 24 '23 09:04 GregxmHu

Should take a few seconds. How are you generating? did you see https://github.com/databrickslabs/dolly#generating-on-other-instances for example?

srowen avatar Apr 24 '23 12:04 srowen

I use model.generate() to generate sentences. Just like using other transformer models. I find it will cause about 2 seconds for dollly-v2-3b to generate a sentence when the max generated length was set to 200. But it will be very quickly for pythia-2.7b to do the same thing. I wonder what makes difference between the two models, because dolloy was built on the top of pythia.

GregxmHu avatar Apr 24 '23 12:04 GregxmHu

Hm, shouldn't be any real difference there. Are you sure the settings are fairly equivalent and output length is the same (not just max)

srowen avatar Apr 24 '23 12:04 srowen

Hm, shouldn't be any real difference there. Are you sure the settings are fairly equivalent and output length is the same (not just max)

yeah, i'm sure i use the same code to generate...

GregxmHu avatar Apr 24 '23 12:04 GregxmHu

Hm, shouldn't be any real difference there. Are you sure the settings are fairly equivalent and output length is the same (not just max)

what do you mean when you say "and output length is the same (not just max)" is there any other outpur length?

GregxmHu avatar Apr 24 '23 12:04 GregxmHu

Hm, shouldn't be any real difference there. Are you sure the settings are fairly equivalent and output length is the same (not just max)

outputs =generator.generate(
                input_ids = input_ids,
                attention_mask= attention_mask,
                min_length = min_length,
                max_length = max_length,
                do_sample = True,
                top_k=None if top_k <0 else top_k,
                top_p=None if top_p <0 else top_p,
                pad_token_id=tokenizer.eos_token_id, 
                eos_token_id=tokenizer.eos_token_id
            )

this is the code i USE to generate

GregxmHu avatar Apr 24 '23 12:04 GregxmHu

I just mean, how much output are you getting from each? the run time is proportional to the output size. You can't directly control it, but affects the comparison. I'm not sure how you load the models either, but they should be the same model, really, just different weights. We suggest you use it like this: https://github.com/databrickslabs/dolly#getting-started-with-response-generation

srowen avatar Apr 24 '23 12:04 srowen

I just mean, how much output are you getting from each? the run time is proportional to the output size. You can't directly control it, but affects the comparison. I'm not sure how you load the models either, but they should be the same model, really, just different weights. We suggest you use it like this: https://github.com/databrickslabs/dolly#getting-started-with-response-generation

I use the same code, and all the parameters passed to the generate method keep the same

GregxmHu avatar Apr 24 '23 12:04 GregxmHu

I have settled the issues. The difference is in the config file in the huggingface. you set "use_cache=False" for dolly

GregxmHu avatar Apr 24 '23 13:04 GregxmHu

Oh yeah, you don't want to measure time to download or load the model here. Make sure it's already loaded then time the generation

srowen avatar Apr 24 '23 13:04 srowen

Oh yeah, you don't want to measure time to download or load the model here. Make sure it's already loaded then time the generation

yeah i think you'd better change the huggingface config file. https://huggingface.co/databricks/dolly-v2-3b/blob/main/config.json

GregxmHu avatar Apr 24 '23 13:04 GregxmHu

@matthayes I think this is a good point - the pythia models have use_cache=True. https://huggingface.co/databricks/dolly-v2-3b/blob/main/config.json#L29 I don't know a lot about this but seems like we would want to do the same.

srowen avatar Apr 24 '23 14:04 srowen

Matt notes that this was probably set to false because gradient checkpointing requires it to be off during training. But we can just edit the resulting model config for now to set use_cache to true. I did that, good catch! https://huggingface.co/databricks/dolly-v2-12b/blob/main/config.json

srowen avatar Apr 25 '23 16:04 srowen