BERT-NER
BERT-NER copied to clipboard
Question: loss in padding sequence
Hi! Thank you so much for releasing BERT-NER codes.
I have a question: the sentences are padded with max_sequence_length and labels in such padded part are set to 0. But when counting the total loss in your code, I found both input and padded parts are included:
one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
loss = tf.reduce_sum(per_example_loss)
I am wondering Is it reasonable. Should we mask the padded part?
@hyli666 Thank your . you are right, its necessary to do masking, if not the total loss will be influenced and the calculation speed will drop. But i think such influences are limited, and won’t lead the results into a worst direction,besides, i filter it when i evaluate the result F-score. I will change the code, when i am free. Thank you so much!
@kyzhouhzau Hello! Thanks for providing the code to use Bert in NER task. It really helps me a lot! I have some similar questions about both loss calculation and the size of labels mentioned in this issue #4 . Can you please explain some detail about masking method? I have no idea about how to calculate loss that eliminates the influence of padded part (though I did read the annotated part in your code about masking loss calculation) . Thank you very much!