docTTTTTquery
docTTTTTquery copied to clipboard
Getting the scores of each query
Hello ! Thanks for this repo. Great job. I'm using the docTTTTTquesry model in the way described in section : Predicting Queries from Passages: T5 Inference with PyTorch (using the transformers library from the hugging face website). i couldn't help but notice that everytime I run the code, different queries are generated. Therefore, I have two questions :
- Is there a way to prevent that from happening, i.e to make the model deterministic ?
- By digging a little bit in the documentation of model.generate, normally, we should be able to get the scores of each query by setting the argument output_scores to True. However, even when I do that, I still don't get the scores. There is also the argument return_dict_in_generate that if I set to True I get the following dictionnary (stored in output) :
As you can see, the inf values make it confusing to get any useful scores out of the tensor. So I'm wondering if the way I proceeded was wrong ? or we just can't get the scores in the case of this model ?
Edit: I've also noticed that using bad_words_ids in model.generate to avoid generating queries containing some undesired words is not working even though it is possible according to the documentation in hugging face.
Thanks again for this repo.
Is there a way to prevent that from happening, i.e to make the model deterministic?
Yes. Either use greedy decoding:
outputs = model.generate(
input_ids=input_ids,
max_length=64,
do_sample=False)
Or fix the pytorch seed before generating the queries:
torch.manual_seed(123)
torch.cuda.manual_seed(123)
By digging a little bit in the documentation of model.generate, normally, we should be able to get the scores of each query by setting the argument output_scores to True. However, even when I do that, I still don't get the scores.
I've tried "output_scores=True" and also got inf
s. Sorry, I'm not sure how it works either.
.loss
from the forward pass could be a proxy for the query score: a lower loss means a higher chance of generating the query. The only disadvantage is that you run the model twice (one for predicting queries, and another for computing the loss)
doc_text = 'The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.'
input_ids = tokenizer.encode(doc_text, return_tensors='pt').to(device)
outputs = model.generate(
input_ids=input_ids,
max_length=64,
do_sample=True,
top_k=10,
num_return_sequences=3)
for i in range(3):
query = tokenizer.decode(outputs[i], skip_special_tokens=True)
loss = model(input_ids=input_ids, labels=outputs[i][None, :]).loss
print(f'sample {i + 1}: query="{query}" loss={loss}')
The output is:
sample 1: query="the manhattan project the impact of technology" loss=9.280818939208984
sample 2: query="why was the manhattan project important" loss=9.904526710510254
sample 3: query="why was the manhattan project so successful in terms of research and development" loss=2.462723731994629
Thank you @rodrigonogueira4 for answering. I'll be trying all of your suggestions. Do you have any suggestions in regards of the last problem I've mentionned?
- I've also noticed that using bad_words_ids in model.generate to avoid generating queries containing some undesired words is not working even though it is possible according to the documentation in hugging face.
Thank again for your work.