PPLM
PPLM copied to clipboard
Where are the samples of automated evaluation?
Thanks for your reply, I have written an program to calculate perplexity by hugging-face transformers interface. But I am not sure which samples are used for perplexity calculation.
Are those samples in the [ human_anotation/pplm_labled_csvs ] directory?
Hi, @Guaguago Can you share how you calculate PPL?
@ehsan-soe hi
import math
import torch
from transformers import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer
tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
model.eval()
def score(sent):
indexed_tokens = tokenizer.encode(sent)
tokens_tensor = torch.tensor([indexed_tokens])
with torch.no_grad():
outputs = model.forward(tokens_tensor, labels=tokens_tensor)
loss = outputs[0]
return math.exp(loss.item())
sents =['there is a book on the desk',
'there is a plane on the desk',
'there is a book in the desk']
print([score(s) for s in sents])
@ehsan-soe do you know how can I use this code to get the perplexity scores of paper?
@dathath Sorry I can't find any samples generated in this repository, can you give me the specified location or some instructions on how can I use this code to get the perplexity scores of PPLM?
@Guaguago Thanks. Perplexity is usually calculated for the test set, However maybe the authors used the generated text to compute the perplexity since they don't have ground truth target. In that case, you should either generate yourself or the author provide the generated samples.
@ehsan-soe You can compute the perplexity of the generated text with regard to another language model (GPT), which is what we do here.
@Guaguago human_annotation/pplm_labeled_csvs has the generated samples. You can read the csvs into python and then process the samples using GPT to compute perplexity.
@dathath Thanks. Can you correct me if I am wrong? are perplexities usually computed on the test set? that is using the NLL of the trained model on the target text, right? However, since here you don't update the weights of GPT-2 and you don't have ground truth text, it doesn't make sense to follow the more conventional approach?
@ehsan-soe Soga! Thank you!
@dathath Thank you! and I find that each item in the CSV file has 2 generated samples.
It seems that one sample is from PPLM and the other one is from a baseline, and their order seems random according to paper. So how do I select samples from different models respectively?
You can use the 'parse_*.ipynb' notebooks to process the CSVs. That should give you samples from different models separately.
@dathath Thank you very much and I will try it! And I have made two programs to test PPL and distinct-n respectively according to your suggestions before. But the scores are not as same as the paper, so I want to do make a further check about 3 questions:
-
The samples I used to calculate PPL and Dist-N are extracted from CSV files under human_anotation/pplm_labled_csvs directory: For each CSV file, I concatenate the first 2 columns to get 360+360=720 samples for each topic. Is this the same as the paper? or did the paper also uses these samples to calculate PPL and Dist-N?
-
For Dist-n: Which tokenizer the paper choose to calculate Dist-n? is there any extra process added like stop words or drop punctuation?
-
Which one did the paper choose: the sentence-level dist-n or corpus-level dist-n?
Are your scores in the same range as the paper?
-
The 360 samples are for pairwise A/B testing from the ablation study -- it consists of 6 pairs, so if you take the 720 sequences you mention, you'll have four copies of each sample and be combining different modes of generation. You can use the parse script to separate out the 60 samples per topic (for each type of generation), and then measure perplexity. As it is, you're computing the average perplexity over all generation methods, but it should be of the same order.
-
GPT2 tokenizer from huggingface.
-
It's at the corpus level for a given topic across all prefixes. We want a measure of the diversity of the sentences generated across different prefixes and for different samples given a specific attribute. E. g., the model can't just satisfy the attribute by generating, " is very good" for every prefix, or when you sample repeatedly from a prefix.
@dathath Thank you so much! Really helpful clue by which I have got the exactly same Dist-1,2,3 scores as paper. But for PPL, most of the scores I have got are a little bit greater than the paper's(about 0-1.5 error range), so I have some questions:
- Are the samples used to calculate dist scores and PPL the same?
- Which tokenizers for PPL the paper use?
- Is there anything missed in my PPL code as follows?
import math
import torch
from transformers import OpenAIGPTLMHeadModel, OpenAIGPTTokenizer
tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
model.eval()
def score(sent):
indexed_tokens = tokenizer.encode(sent)
tokens_tensor = torch.tensor([indexed_tokens])
with torch.no_grad():
outputs = model.forward(tokens_tensor, labels=tokens_tensor)
loss = outputs[0]
return math.exp(loss.item())
def score(sentence, tokenizer, model):
tokenize_input = tokenizer.tokenize(sentence)
tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)]).cuda()
loss = model(tensor_input, lm_labels=tensor_input)
return math.exp(loss)
tokenizer_LM = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
model_LM = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
Yes, the samples are the same. This is what we do, so should ideally match -- probably can try matching with the perplexities by topic/sentiment (see appendix). The Layer-norm layers used in the hugging-face implementation of transformers seem to have changed a little-bit between versions. I suspect this might be one possible cause for the discrepancy if you're using a recent version of "pytorch-transformers".
@dathath Is there any special process to the token "<|endoftext|>" and the '\n' within a sentence in the calculation of PPL? Should I drop them before calculating ppl?
@dathath After having solved some bugs and warnings, I found that my results of PPL turn to much lower than the paper but the dist scores almost ideally match. I have tried different versions of transformers but the results unchanged. So could you correct me if there are some wrong steps in my process to separate out samples for each model as follows?
-
Use
a_cat, b_cat = decode(order)[1:]
clue in parse script to assign each sample to the method who generates it. -
Construct
set()
for each method to remove duplicates, then for each B/BR/BC/BCR I got 60 samples respectively(but for “religion” I found it less than 60). -
For each sample, use
sample.replace('<|endoftext|>'), '')
to drop the prefix, and use GPT to assign a score. -
Compute the mean of 60 ppl scores to be the final ppl score for each B/BR/BC/BCR of a certain topic.
I need some help, appreciate your reply
Hi, Sorry for the late response, it's been hectic. Overall what you are doing seems reasonable to me.
- We don't actually drop '<|endoftext|>'.
Can you drop me (and Andrea) an email? The issue is a little hard to follow here. I will try to respond by the weekend.
@dathath @Andrea Hi, thank you, you are so nice! This is my email: [email protected] I need your help, please!
@dathath Hi