lm-scorer
lm-scorer copied to clipboard
Add BERTLMScore
Hi, I propose to add a BERT-based language modeling score.
I implemented the idea describe in Effective Sentence Scoring Method Using BERT for Speech. In this paper, this authors said they obtained better results by using a bidirectional LM (by comparison to a unidirectional LM as GPT2).
The idea is to successively mask each token and retrieve the probability that BERT give to that tokens in the output.
For now, I pass all the mask sentences in one single batch to BERT so it may raise some memory problem for very long sentences.
Also, I did not manage to run the different test with poetry as you asked. I am not familiar with automatic testing and poetry, so I have to look into it.
For now, I pass all the mask sentences in one single batch to BERT so it may raise some memory problem for very long sentences.
Can you please add a flag to allow to setup a batch size?
Also, I did not manage to run the different test with poetry as you asked.
There is a bug in the branch you are using (fixed here https://github.com/simonepri/lm-scorer/commit/df67f700a9c2706ed6e2babe67ee46ede81b31de) that won't allow you to run poetry. Please run the following on your branch to fetch the fix:
git fetch upstream
git rebase upstream/master
I am not familiar with automatic testing and poetry, so I have to look into it.
All the tests run automatically every time you submit changes to this PR.
![](https://user-images.githubusercontent.com/3505087/80493863-dbc05400-895d-11ea-8d9a-f9b37acc8f9e.png)
If you click on details and expand the line where you can see a red cross, you will see exactly what's the problem.
Ok, succeed to make work black and poetry. Also I copy/paste the test file you made for GPT2 and it is interesting to see that it does not pass two tests :
I need to finish this project, but I do not have enough time. I need to finish this project, and I do not have enough time.
and
I can walk there. I can ambulate there.
Second one is more surprising.
I will add the flag option.
I added the possibility to input the mask sentences by batches. I have also tried to add new test functions (to check if this does not change the score results), however it is not taken into account when I run poetry run task test
(by the way, there is a small type in readme, you ask to run poetry run task testS
) I will look into it tomorrow.
I woud like to get #4 merged before reviewing this. Since we are chaning the API we have to update this PR later anyway.
Things todo here before we can merge this:
- [ ] Refactor the code to match the new API
- [ ] Update the tests to match the ones implemented for GPT2
Hi Simone, I agree and will work on that next week.
Sure, no hurry! The comment was meant as a reminder for the future.
Hi @simonepri,
I have begin to work on adapting BERTScore to new API and I have a question about the batching part. In BERTScore, you can not directly use BatchedLMScorer because you don't batch by sentence but by masked sentence. For instance, suppose the user set a batch size of 4 and input 2 sentences :
I like tennis
I like to eat potatoes
I will in fact input 8 sentences to BERT
MASK like tennis
I MASK tennis
I like MASK
MASK like to eat potatoes
I MASK to eat potatoes
I like MASK eat potatoes
I like to MASK potatoes
I like to eat MASK
And so my first batch will consist of:
MASK like tennis
I MASK tennis
I like MASK
MASK like to eat potatoes
and second of :
I MASK to eat potatoes
I like MASK eat potatoes
I like to MASK potatoes
I like to eat MASK
So what I have done for now is the following :
- Encode every sentence provide by user in one shot using BatchEncoding
- Expand the input ids to replace each sentence by mask sentences
- Input to BERT by batch of batch_size
- Resplit the result by the sentences.
Code is not finished yet (not tested, not cleaned ...), but just to give you an idea : https://gist.github.com/dldk-gael/2025f0584f781716697463c62a9779ee
My question is: according you what is the best way to proceed with these batch specificity ? Do I simply overwrite _tokens_log_prob from BatchedLMScorer ?
What I would do is the following:
First, You put your code inside a function called _tokens_log_prob_for_sentence
.
Then inside the _tokens_log_prob_for_batch
you simply do a for loop that calls a _tokens_log_prob_for_sentence
for every sentence in the batch.
In the function _tokens_log_prob_for_sentence
you then generate the masked sentences and you batch them according to the batch_size
field (which is what you are already doing I guess).
How does it sound?
PS: You don't want to tokenize all the sentences at once if you then process the sentences independently. If you do so, you only introduce unnecessary padding. Just process each sentence one by one using encode_plus
instead.
Notes on the code:
Here https://gist.github.com/dldk-gael/2025f0584f781716697463c62a9779ee#file-bert_score-py-L12-L23
You don't need this. Just add the return_special_tokens_mask=True
when you call the encode_plus
method and then do the following:
tokens_mask = encoding["special_tokens_mask"] == 0
ids = encoding["input_ids"][tokens_mask]
By doing so, when the user input lot of small sentences but use large batch size (for instance because he have lot of GPU memory), we won't fill the batch at each input. And sometime, the problem will be the reverse. That why I think one batch should not necessary contained only masked sentences coming from one unique sentence.
Yes that's true, but it also makes the logic much much more easier to maintain. I would rather implement a simple version first test it, and then if it's worth it, improve it.
Anyway, If you find a clean way to batch multiple sentences then sure go ahead and do it.
If you do so, just be careful to not build all the masked sequences at once anyway.
Say that the user inputs 64 sentences each long 64 tokens you don't want to store 64^2 sequences of tokens in memory just to process up to batch_size
sentences at a time.
Hi @simonepri , I gave a try to the second option because BERT scoring method request many more operations than in GPT2 and so I want to use full batch capacity in any cases. It is working even if is still some test to update (and some minor change to made as handling the empty string case correctly), but because I am not respecting BatchedLMScorer structure here, I want your opinion on this draft before going further.
Hi @simonepri, did you have time to look ? For your information, next I plan to compare the different scorer using Blimp dataset. I don't know if you have used this dataset before : https://arxiv.org/pdf/1912.00582.pdf but it is really related to what we try to achieve here ! (and it is available on huggingface/nlp ! :) )
Quick update: I run some test on the Blimp dataset. With GPT2 (base) score I get 0.9 accuracy vs 0.85 with Bert Score. So I agree we can go back a to more easy-to-maintain version of the batch handler within Bert. (btw, I observe a huge drop when using hmean reducing strategy, but i did not understand the idea behind this one).
@dldk-gael Sorry I don't have time at the moment to review the code. I should be able to come back to you in about a week. Ping me in case I foget.
but i did not understand the idea behind this one
Wha do you mean? It's just the harmonic mean of the individual token probabilities. I added it just for completeness (i.e. to have all the three Pythagorean means). It wasn't meant to be particularly useful.
Hi @simonepri , are you still planning to work on this one ?