BERT-pytorch pred_loss decrease fast while avg

I try to run the code on a small dataset and I find that pred_loss decrease fast while avg_acc stay at 50%. It is strange to me since decrease in pred_loss should indicates increase in accuracy.

Oct 23 '18 09:10 jiqiujia

I also meet the same problem in small dataset.

Oct 23 '18 10:10 wenhaozheng-nju

me too

Oct 23 '18 12:10 NiHaoUCAS

Hmm interesting.. Is this the result of 0.0.1a4 version? And How did you guys print out that result?

Oct 23 '18 13:10 codertimo

Hmm interesting.. Is this the result of 0.0.1a4 version? And How did you guys print out that result?

0.0.01a3 vesion the result is print out by bert cmd , no any modify.

Oct 23 '18 13:10 NiHaoUCAS

Hmm interesting.. Is this the result of 0.0.1a4 version? And How did you guys print out that result?

I try using 0.0.1a4 and the result is the same

Oct 23 '18 16:10 jiqiujia

Hmmm... anyone have any clues?

Oct 24 '18 01:10 codertimo

I try using different data: continuous sentence pair from same document, concat continuous sentence as longer sentence , query and document pair, the result is the same. I also found that there is a big gap between next_loss and mask_loss although they use the same loss function.

Oct 24 '18 03:10 yangze01

Probably the criterion loss function is the problem.

# shape [10, 2], not very accurate output
out = torch.tensor([[ -8.4014,  -0.0002],
        [-10.3151,  -0.0000],
        [ -8.8440,  -0.0001],
        [ -7.5148,  -0.0005],
        [-11.0145,  -0.0000],
        [-10.9770,  -0.0000],
        [-13.3770,  -0.0000],
        [ -9.5733,  -0.0001],
        [ -9.5957,  -0.0001],
        [ -9.0712,  -0.0001]])
# shape [10], next sentence label
label = torch.tensor([1,1,0,1,0,0,1,0,0,1])
original_criterion = nn.NLLLoss(ignore_index=0)
criterion = nn.NLLLoss()
original_loss = original_criterion(out, label)
loss = criterion(out, label)

with the above code snippet, the original_loss is 0.0002, and loss is 5.0005.

I changed following code in trainer/pretrain.py:

self.criterion = nn.NLLLoss(ignore_index=0)

to:

self.criterion = nn.NLLLoss()

And as the magnitude of next_loss is smaller than mask_loss, I also over weight the next_loss, and get 58% next accuracy after train on my corpus for one epoch.

Oct 24 '18 12:10 cairoHy

Probably the criterion loss function is the problem.
# shape [10, 2], not very accurate output
out = torch.tensor([[ -8.4014,  -0.0002],
        [-10.3151,  -0.0000],
        [ -8.8440,  -0.0001],
        [ -7.5148,  -0.0005],
        [-11.0145,  -0.0000],
        [-10.9770,  -0.0000],
        [-13.3770,  -0.0000],
        [ -9.5733,  -0.0001],
        [ -9.5957,  -0.0001],
        [ -9.0712,  -0.0001]])
# shape [10], next sentence label
label = torch.tensor([1,1,0,1,0,0,1,0,0,1])
original_criterion = nn.NLLLoss(ignore_index=0)
criterion = nn.NLLLoss()
original_loss = original_criterion(out, label)
loss = criterion(out, label)
with the above code snippet, the original_loss is 0.0002, and loss is 5.0005.

I changed following code in trainer/pretrain.py:
self.criterion = nn.NLLLoss(ignore_index=0)
to:
self.criterion = nn.NLLLoss()
And as the magnitude of next_loss is smaller than mask_loss, I also over weight the next_loss, and get 58% next accuracy after train on my corpus for one epoch.

That's right. I just figure it out. Also note that for masklm, we still need ignore_index=0 since we only want to predict the masked words.

Oct 24 '18 12:10 jiqiujia

@cairoHy Wow thank you for your smart analysis.

I just fixed this issue on 0.0.1a5 version branch. And changes is under here.

https://github.com/codertimo/BERT-pytorch/blob/2a0b28218f4fde216cbb7750eb584c2ada0d487b/bert_pytorch/trainer/pretrain.py#L61-L62

https://github.com/codertimo/BERT-pytorch/blob/2a0b28218f4fde216cbb7750eb584c2ada0d487b/bert_pytorch/trainer/pretrain.py#L98-L102

Oct 25 '18 01:10 codertimo

Thanks everyone who join this investigation :) It was totally my fault and sorry for your inconvenience during bug fixing.

Additionally, is here anyone can test the new code with your own corpus? Any feedback would be welcome, and you can reinstall new version using under command.

git clone https://github.com/codertimo/BERT-pytorch.git
git checkout 0.0.1a5
pip install -U .

specially thanks for @jiqiujia @cairoHy @NiHaoUCAS @wenhaozheng-nju

Oct 25 '18 01:10 codertimo

@cairoHy after the modification, the model can't converge. Any suggestions?

Oct 25 '18 01:10 jiqiujia

@jiqiujia Can you tell me about the details? like figure or logs

Oct 25 '18 01:10 codertimo

@codertimo The loss just don't converge

Oct 25 '18 02:10 jiqiujia

bert-small-25-logs.txt This is the result of my 1M corpus with 1epoch, anyone can review this result

Oct 26 '18 01:10 codertimo

@codertimo Could you please show your parameters setting?

Oct 26 '18 01:10 yangze01

@yangze01 just default params with batch size 128

Oct 26 '18 01:10 codertimo

@codertimo I think these code have some errors, if len(t1) is longer than seq_len, the bert_input will only contains t1. and the length of segment_label also contains only the segment label of t1

Oct 26 '18 01:10 yangze01

I know but the line size of my corpus is usually less the 10 for each sentence. And seq_len should be properly set by the user. I don't think it's the bug, and not in this thread

Oct 26 '18 02:10 codertimo

@codertimo I think the sample of next sentence has a serious bug. Supposed 'B' is the next sentence of 'A', you may never sample the negative instance with 'A'.

Oct 26 '18 03:10 wenhaozheng-nju

@wenhaozheng-nju I did negative sampling

https://github.com/codertimo/BERT-pytorch/blob/0d076e09fd5aef1601654fa0abfc2c7f0d57e5d9/bert_pytorch/dataset/dataset.py#L92-L99

https://github.com/codertimo/BERT-pytorch/blob/0d076e09fd5aef1601654fa0abfc2c7f0d57e5d9/bert_pytorch/dataset/dataset.py#L114-L125

Oct 26 '18 03:10 codertimo

@codertimo Suppose the dataset is: A \t B; B \t C; C \t D; D \t E; After your preprocessing: A \t B; B \t Random; C \t D; D \t Random; The negative instance "A \t Random" may never be sampled

Oct 26 '18 03:10 wenhaozheng-nju

@wenhaozheng-nju hmmm but do you think it's the main problem of this issue? I guess it's a model problem.

Oct 26 '18 03:10 codertimo

@codertimo Yes, the model should sample positive and negative instance for each sentence in the sentence pair classification problem. I think that the two task are the same.

Oct 26 '18 03:10 wenhaozheng-nju

@wenhaozheng-nju Then do you think if i change the negative sampling code as you requested, than this issue could be figure it out?

Oct 26 '18 03:10 codertimo

@codertimo I think everyone here wants to solve the problem, calm down, let's focus on the issue. @wenhaozheng-nju If you think it's the problem, you can try to modify the code and run.(but I think it's not the main problem. random negative sample is a commonly used strategy.)

Oct 26 '18 03:10 yangze01

I remove dropout in all layers and now my model success to converge. Maybe dropout in all layers is too big a regularization for small datasets? Or there is something wrong with dropout in this model implementation. After 900 epoch, my training dataset achieve an accuracy of 81%.

@wenhaozheng-nju if you have any other problem, please open another issue.

Oct 26 '18 04:10 jiqiujia

@jiqiujia Wow, it's cool. How long is the sentence of your corpus?

Oct 26 '18 04:10 yangze01

I set parameter --seq_len to 32

Oct 26 '18 05:10 jiqiujia

@jiqiujia Looks pretty awesome!! Can you share the full training logs using file? And how much big is your corpus?? I would like to know the details. Thank you for your effort, it's really helpful to us

Oct 26 '18 05:10 codertimo

@jiqiujia I trained my dataset for 10hours last night, with dropout rate 0.0 (which is same with no dropout) and dropout rate 0.1. Unfortunately, both test loss was not coveraged. 2018-10-27 10 57 02

Oct 27 '18 02:10 codertimo

@jiqiujia could you share more details? I trained with 1000000 samples, seq_len: 64, vocab_size: 100000 dropout = 0, but the result is the same as before.

Oct 27 '18 02:10 yangze01

my parameter settings is as follows, and I set next_setence loss's weight to be 5(It should be annealed, or set to 1 I think). I only have about 10000 sentence pairs and the vocab_size is about 4000. By the way, I also tried to test based on opennmt-py's tranformer implementation but it failed to converge. I noticed some different implementations. Transformer seems to be tricky.

Oct 27 '18 03:10 jiqiujia

I've tried some varied parameters and it seems that on my dataset, these parameter doesn't have much impact. Only dropout is critical. But my dataset is rather small. I choose a small dataset just to debug. I will tried some larger datasets. Hope it's helpful. You're welcomed to share your experiments.

Oct 27 '18 03:10 jiqiujia

And this is roughly the whole training log. The accuracy seems to be stuck at 81% finally. Uploading _gaiastack_log_stdout (3).log…

Oct 27 '18 03:10 jiqiujia

It works well in my code. Acc rate got over 90.0

The base of code is version 0.0.1a3. I've changed 3 parts of this version of code.

First, set dropout off in every layers. dropout = 0.0

Second, fix NLLLoss setting. self.criterion = nn.NLLLoss(ignore_index=0) to self.criterion = nn.NLLLoss()

Third, fix prob variable setting.

prob = random.random()
if prob < 0.15:
    prob /= 0.15

    # 80% randomly change token to mask token
    if prob < 0.8:
        tokens[i] = self.vocab.mask_index

    # 10% randomly change token to random token
    elif prob < 0.9:
        tokens[i] = random.randrange(len(self.vocab))

After 999 epochs, the result as below 2018-10-27 17 01 29

parameter setting is here

hidden=256
layers=8
attn_heads=8
seq_len=32
batch_size=256
epochs=1000
num_workers=5
with_cuda=True
log_freq=50
corpus_lines=None
lr=1e-4
adam_weight_decay=0.01
adam_beta1=0.9
adam_beta2=0.999
dropout=0.0

Dataset is like this

Language : Japanese
Vocab size : 4670
Sentences amount : 1000

Of course, the changes that I wrote above have been already fixed in the latest version. But if you have not change some part of codes, It may not work well Please check it.

Oct 27 '18 08:10 Kosuke-Szk

@Kosuke-Szk Thank you for sharing your result with us. After I saw @Kosuke-Szk 's result, I thought "Isn't our model is pretty small to train..?" As you guys know, we reduced our model to make them trainable using our GPU. And the training result was bad. However, the similar code (which is almost same with 0.0.1a4) works with smaller vocab size and dataset. So... If we make our model more bigger, than it's gonna be work? I thinks it's kind of underfitting... not just the problem of model. Anyone has idea about this issue?

Oct 29 '18 05:10 codertimo

Hi there, I trained the model on a big dataset (wiki 2500M + bookscorpus 800M, same as the BERT paper) for 200000 steps and achieve an accuracy of 91%. 2018-10-30 11 40 25 I set weight decay = 0, I think use one of (dropout, weight decay) is enough.

Oct 30 '18 03:10 wangwei7175878

@wangwei7175878 WOW this are brilliant, this is really huge step for us. Thank you for your effort and computation resource. Is there any result which used the weigth_decay with default? And can you share the full log as a file??

Origin corpus

How did you get the origin corpus? I tried very hard to get the corpus, but I failed... Even I sent the email to authors to get the origin corpus, but I failed. If it possible, can you share the origin corpus, so that I can test the real performance.

Oct 30 '18 04:10 codertimo

Hi there, I trained the model on a big dataset (wiki 2500M + bookscorpus 800M, same as the BERT paper) for 200000 steps and achieve an accuracy of 91%.

@wangwei7175878 Can you share your pre-trained model? I'm really looking froward to trying this out but I don't have that kind of processing power.

Thank you for your efforts.

Oct 30 '18 04:10 briandw

@codertimo The model can't converge use weight_decay = 0.01. My dataset is not exactly the origin corpus, but I think it is almost the same. Wiki data can easily download from https://dumps.wikimedia.org/enwiki/ and you need a web spider to get bookscorpus from [https://www.smashwords.com/](https://www.smashwords.com/

Oct 30 '18 05:10 wangwei7175878

@briandw My pre-trained model failed on downstream tasks(Fine-tune model can't converge). I will share the pre-trained model once it works.

Oct 30 '18 05:10 wangwei7175878

@codertimo Here is the whole log. It took me almost one week to train about 250000 steps. The accuracy seems to be stuck at 91% which is reported as 98% in origin paper. log_run2_hhh_all_data_next_weight_1_no_decay.txt

Oct 30 '18 05:10 wangwei7175878

@wangwei7175878 Can you share your code for crawling and preprocessing on above issue? Or if it possible can you share the full corpus with shared drive(dropbox, google drive etc). This would be really helpful to us.

Oct 30 '18 05:10 codertimo

@wangwei7175878 very interesting, authors said 0.01 weight decay is default parameter that they used. What's your parameter setting? it is same with default setting with our code except weigth_decay?

Oct 30 '18 05:10 codertimo

Hi there, I believe I fixed why model can’t converge with weight_decay = 0.01. Follow openai’s code here: I think BERT used adamW instead of adam. With rewriting this adam code in pytorch, my model can converge now with default setting.

Oct 31 '18 05:10 wangwei7175878

@wangwei7175878 Sounds Great! Can you make a pull request with your adamW implementation? I'll test it on my corpus too 👍

Oct 31 '18 06:10 codertimo

I use my corpus, after three epochs, the acc rate is 73.54% .I set weight_dacay = 0. The other parameters are the default. Training continues.

Nov 02 '18 01:11 waynedane

Just for your reference. I also confirmed the accuracy increase following @Kosuke-Szk 's suggestion. loss acc

Though the model was resized to a really small one due to the memory limitation (< 12 GB), it still worked. Hyperparameters were:

hidden=240 #768
layers=3 #12
attn_heads=3 #12
seq_len=30 # 60
batch_size=8 #32
epochs=10
num_workers=4#5
with_cuda=True
log_freq=20
corpus_lines=None
lr=1e-3
adam_weight_decay=0.00
adam_beta1=0.9
adam_beta2=0.999
dropout=0.0
min_freq=20 #7

I used 13 GB of Wikipedia English corpus with vocabulary size of 775k. But I stopped the job at just 2% progress of the first epoch because it said it would take thousands of hours.

Jan 11 '19 05:01 shionhonda

Hi there, I trained the model on a big dataset (wiki 2500M + bookscorpus 800M, same as the BERT paper) for 200000 steps and achieve an accuracy of 91%. I set weight decay = 0, I think use one of (dropout, weight decay) is enough.

Need ur machine, system and gpu configuration, thx.

And I've also made the wiki + bookcorpus data set, will publish the docs to help for reconstruction.

Jan 16 '19 09:01 zheolong

@shionhonda How do u print the accuracy for every few global steps, and finally create that curve?

Jan 25 '19 02:01 zheolong

@zheolong The loss and accuracy is exactly what is printed on console by data_iter in pretrain.py. Insert the following code here and plot it.

with open(FILENAME, 'a') as f:
    f.write('%d,%f,%f\n' %(i, avg_loss/(i+1), total_correct/total_element*100))

Jan 25 '19 02:01 shionhonda

oh my god ! I have no idea about this. I have the same result with avg_acc =50, according these methods in the issue.

Jul 23 '20 13:07 scuhz

BERT-pytorch BERT-pytorch copied to clipboard

pred_loss decrease fast while avg_acc stay at 50%

Origin corpus

BERT-pytorch
BERT-pytorch copied to clipboard