BERT-NER Question: loss in padding sequence

Question: loss in padding sequence

Open hyli666 opened this issue 6 years ago • 2 comments

Hi! Thank you so much for releasing BERT-NER codes. I have a question: the sentences are padded with max_sequence_length and labels in such padded part are set to 0. But when counting the total loss in your code, I found both input and padded parts are included: one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32) per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) loss = tf.reduce_sum(per_example_loss)

I am wondering Is it reasonable. Should we mask the padded part?

Nov 14 '18 06:11 hyli666

@hyli666 Thank your . you are right, its necessary to do masking, if not the total loss will be influenced and the calculation speed will drop. But i think such influences are limited, and won’t lead the results into a worst direction,besides, i filter it when i evaluate the result F-score. I will change the code, when i am free. Thank you so much!

Nov 15 '18 00:11 kyzhouhzau

@kyzhouhzau Hello! Thanks for providing the code to use Bert in NER task. It really helps me a lot! I have some similar questions about both loss calculation and the size of labels mentioned in this issue #4 . Can you please explain some detail about masking method? I have no idea about how to calculate loss that eliminates the influence of padded part (though I did read the annotated part in your code about masking loss calculation) . Thank you very much!

Dec 13 '18 03:12 owen-chenhn

BERT-NER BERT-NER copied to clipboard

Question: loss in padding sequence

BERT-NER
BERT-NER copied to clipboard