Use use_cache=True config?
I run dolly-v2-3b on single 32G-V100 GPU, and find it runs relatively slow...
Should take a few seconds. How are you generating? did you see https://github.com/databrickslabs/dolly#generating-on-other-instances for example?
I use model.generate() to generate sentences. Just like using other transformer models. I find it will cause about 2 seconds for dollly-v2-3b to generate a sentence when the max generated length was set to 200. But it will be very quickly for pythia-2.7b to do the same thing. I wonder what makes difference between the two models, because dolloy was built on the top of pythia.
Hm, shouldn't be any real difference there. Are you sure the settings are fairly equivalent and output length is the same (not just max)
Hm, shouldn't be any real difference there. Are you sure the settings are fairly equivalent and output length is the same (not just max)
yeah, i'm sure i use the same code to generate...
Hm, shouldn't be any real difference there. Are you sure the settings are fairly equivalent and output length is the same (not just max)
what do you mean when you say "and output length is the same (not just max)" is there any other outpur length?
Hm, shouldn't be any real difference there. Are you sure the settings are fairly equivalent and output length is the same (not just max)
outputs =generator.generate(
input_ids = input_ids,
attention_mask= attention_mask,
min_length = min_length,
max_length = max_length,
do_sample = True,
top_k=None if top_k <0 else top_k,
top_p=None if top_p <0 else top_p,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id
)
this is the code i USE to generate
I just mean, how much output are you getting from each? the run time is proportional to the output size. You can't directly control it, but affects the comparison. I'm not sure how you load the models either, but they should be the same model, really, just different weights. We suggest you use it like this: https://github.com/databrickslabs/dolly#getting-started-with-response-generation
I just mean, how much output are you getting from each? the run time is proportional to the output size. You can't directly control it, but affects the comparison. I'm not sure how you load the models either, but they should be the same model, really, just different weights. We suggest you use it like this: https://github.com/databrickslabs/dolly#getting-started-with-response-generation
I use the same code, and all the parameters passed to the generate method keep the same
I have settled the issues. The difference is in the config file in the huggingface. you set "use_cache=False" for dolly
Oh yeah, you don't want to measure time to download or load the model here. Make sure it's already loaded then time the generation
Oh yeah, you don't want to measure time to download or load the model here. Make sure it's already loaded then time the generation
yeah i think you'd better change the huggingface config file. https://huggingface.co/databricks/dolly-v2-3b/blob/main/config.json
@matthayes I think this is a good point - the pythia models have use_cache=True. https://huggingface.co/databricks/dolly-v2-3b/blob/main/config.json#L29 I don't know a lot about this but seems like we would want to do the same.
Matt notes that this was probably set to false because gradient checkpointing requires it to be off during training. But we can just edit the resulting model config for now to set use_cache to true. I did that, good catch! https://huggingface.co/databricks/dolly-v2-12b/blob/main/config.json