lm-scorer Add BERTLMScore

Add BERTLMScore

Open dldk-gael opened this issue 4 years ago • 16 comments

Hi, I propose to add a BERT-based language modeling score.

I implemented the idea describe in Effective Sentence Scoring Method Using BERT for Speech. In this paper, this authors said they obtained better results by using a bidirectional LM (by comparison to a unidirectional LM as GPT2).

The idea is to successively mask each token and retrieve the probability that BERT give to that tokens in the output.

For now, I pass all the mask sentences in one single batch to BERT so it may raise some memory problem for very long sentences.

Also, I did not manage to run the different test with poetry as you asked. I am not familiar with automatic testing and poetry, so I have to look into it.

Apr 28 '20 12:04 dldk-gael

For now, I pass all the mask sentences in one single batch to BERT so it may raise some memory problem for very long sentences.

Can you please add a flag to allow to setup a batch size?

Also, I did not manage to run the different test with poetry as you asked.

There is a bug in the branch you are using (fixed here https://github.com/simonepri/lm-scorer/commit/df67f700a9c2706ed6e2babe67ee46ede81b31de) that won't allow you to run poetry. Please run the following on your branch to fetch the fix:

git fetch upstream
git rebase upstream/master

I am not familiar with automatic testing and poetry, so I have to look into it.

All the tests run automatically every time you submit changes to this PR.

If you click on details and expand the line where you can see a red cross, you will see exactly what's the problem.

Apr 28 '20 13:04 simonepri

Ok, succeed to make work black and poetry. Also I copy/paste the test file you made for GPT2 and it is interesting to see that it does not pass two tests :

I need to finish this project, but I do not have enough time. I need to finish this project, and I do not have enough time.

and

I can walk there. I can ambulate there.

Second one is more surprising.

I will add the flag option.

Apr 29 '20 16:04 dldk-gael

I added the possibility to input the mask sentences by batches. I have also tried to add new test functions (to check if this does not change the score results), however it is not taken into account when I run poetry run task test (by the way, there is a small type in readme, you ask to run poetry run task testS) I will look into it tomorrow.

Apr 29 '20 21:04 dldk-gael

I woud like to get #4 merged before reviewing this. Since we are chaning the API we have to update this PR later anyway.

Apr 30 '20 15:04 simonepri

Things todo here before we can merge this:

[ ] Refactor the code to match the new API
[ ] Update the tests to match the ones implemented for GPT2

May 16 '20 13:05 simonepri

Hi Simone, I agree and will work on that next week.

May 18 '20 19:05 dldk-gael

Sure, no hurry! The comment was meant as a reminder for the future.

May 18 '20 22:05 simonepri

Hi @simonepri,

I have begin to work on adapting BERTScore to new API and I have a question about the batching part. In BERTScore, you can not directly use BatchedLMScorer because you don't batch by sentence but by masked sentence. For instance, suppose the user set a batch size of 4 and input 2 sentences :

I like tennis
I like to eat potatoes

I will in fact input 8 sentences to BERT

MASK like tennis
I MASK tennis
I like MASK
MASK like to eat potatoes
I MASK to eat potatoes
I like MASK eat potatoes
I like to MASK potatoes
I like to eat MASK

And so my first batch will consist of:

MASK like tennis
I MASK tennis
I like MASK
MASK like to eat potatoes

and second of :

I MASK to eat potatoes
I like MASK eat potatoes
I like to MASK potatoes
I like to eat MASK

So what I have done for now is the following :

Encode every sentence provide by user in one shot using BatchEncoding
Expand the input ids to replace each sentence by mask sentences
Input to BERT by batch of batch_size
Resplit the result by the sentences.

Code is not finished yet (not tested, not cleaned ...), but just to give you an idea : https://gist.github.com/dldk-gael/2025f0584f781716697463c62a9779ee

My question is: according you what is the best way to proceed with these batch specificity ? Do I simply overwrite _tokens_log_prob from BatchedLMScorer ?

May 25 '20 12:05 dldk-gael

What I would do is the following:

First, You put your code inside a function called _tokens_log_prob_for_sentence. Then inside the _tokens_log_prob_for_batch you simply do a for loop that calls a _tokens_log_prob_for_sentence for every sentence in the batch.

In the function _tokens_log_prob_for_sentence you then generate the masked sentences and you batch them according to the batch_size field (which is what you are already doing I guess).

How does it sound?

PS: You don't want to tokenize all the sentences at once if you then process the sentences independently. If you do so, you only introduce unnecessary padding. Just process each sentence one by one using encode_plus instead.

Notes on the code:

Here https://gist.github.com/dldk-gael/2025f0584f781716697463c62a9779ee#file-bert_score-py-L12-L23

You don't need this. Just add the return_special_tokens_mask=True when you call the encode_plus method and then do the following:

tokens_mask = encoding["special_tokens_mask"] == 0
ids = encoding["input_ids"][tokens_mask]

May 25 '20 13:05 simonepri

By doing so, when the user input lot of small sentences but use large batch size (for instance because he have lot of GPU memory), we won't fill the batch at each input. And sometime, the problem will be the reverse. That why I think one batch should not necessary contained only masked sentences coming from one unique sentence.

May 25 '20 13:05 dldk-gael

Yes that's true, but it also makes the logic much much more easier to maintain. I would rather implement a simple version first test it, and then if it's worth it, improve it.

Anyway, If you find a clean way to batch multiple sentences then sure go ahead and do it. If you do so, just be careful to not build all the masked sequences at once anyway. Say that the user inputs 64 sentences each long 64 tokens you don't want to store 64^2 sequences of tokens in memory just to process up to batch_size sentences at a time.

May 25 '20 13:05 simonepri

Hi @simonepri , I gave a try to the second option because BERT scoring method request many more operations than in GPT2 and so I want to use full batch capacity in any cases. It is working even if is still some test to update (and some minor change to made as handling the empty string case correctly), but because I am not respecting BatchedLMScorer structure here, I want your opinion on this draft before going further.

May 27 '20 10:05 dldk-gael

Hi @simonepri, did you have time to look ? For your information, next I plan to compare the different scorer using Blimp dataset. I don't know if you have used this dataset before : https://arxiv.org/pdf/1912.00582.pdf but it is really related to what we try to achieve here ! (and it is available on huggingface/nlp ! :) )

Jun 11 '20 08:06 dldk-gael

Quick update: I run some test on the Blimp dataset. With GPT2 (base) score I get 0.9 accuracy vs 0.85 with Bert Score. So I agree we can go back a to more easy-to-maintain version of the batch handler within Bert. (btw, I observe a huge drop when using hmean reducing strategy, but i did not understand the idea behind this one).

Jun 11 '20 13:06 dldk-gael

@dldk-gael Sorry I don't have time at the moment to review the code. I should be able to come back to you in about a week. Ping me in case I foget.

but i did not understand the idea behind this one

Wha do you mean? It's just the harmonic mean of the individual token probabilities. I added it just for completeness (i.e. to have all the three Pythagorean means). It wasn't meant to be particularly useful.

Jun 17 '20 17:06 simonepri

Hi @simonepri , are you still planning to work on this one ?

Jul 16 '20 11:07 dldk-gael

lm-scorer lm-scorer copied to clipboard

Add BERTLMScore

Notes on the code:

lm-scorer
lm-scorer copied to clipboard