DeVLBert
DeVLBert copied to clipboard
issue
每个句子保留2个或者4个混淆词,是每个句子肯定会取2个或者4词吗,请问那行代码实现的。 while len(tokens) > 34: tokens.pop() 为什么保留34个词呢,这行代码是什么意思呢
混淆词个数是在get_id2class.py中的limit实现的吗?
混淆词个数是在get_id2class.py中的limit实现的吗?
Yes.
每个句子保留2个或者4个混淆词,是每个句子肯定会取2个或者4词吗,请问那行代码实现的。 while len(tokens) > 34: tokens.pop() 为什么保留34个词呢,这行代码是什么意思呢
The number of confound words is up to you. Please refer to https://github.com/shengyuzhang/DeVLBert/blob/master/dic/get_id2class.py#L24.
Because the number of words in a sentence is uncertain, so we truncate the sentence and take only first 36 tokens in the training and inference process. Because of the existence of [CLS] and [SEP], we only keep 34 words in a sentence.