DeVLBert icon indicating copy to clipboard operation
DeVLBert copied to clipboard

issue

Open 184446223 opened this issue 2 years ago • 3 comments

每个句子保留2个或者4个混淆词,是每个句子肯定会取2个或者4词吗,请问那行代码实现的。 while len(tokens) > 34: tokens.pop() 为什么保留34个词呢,这行代码是什么意思呢

184446223 avatar Oct 27 '22 11:10 184446223

混淆词个数是在get_id2class.py中的limit实现的吗?

184446223 avatar Oct 27 '22 11:10 184446223

混淆词个数是在get_id2class.py中的limit实现的吗?

Yes.

jiangtann avatar Oct 29 '22 12:10 jiangtann

每个句子保留2个或者4个混淆词,是每个句子肯定会取2个或者4词吗,请问那行代码实现的。 while len(tokens) > 34: tokens.pop() 为什么保留34个词呢,这行代码是什么意思呢

The number of confound words is up to you. Please refer to https://github.com/shengyuzhang/DeVLBert/blob/master/dic/get_id2class.py#L24.

Because the number of words in a sentence is uncertain, so we truncate the sentence and take only first 36 tokens in the training and inference process. Because of the existence of [CLS] and [SEP], we only keep 34 words in a sentence.

jiangtann avatar Oct 29 '22 12:10 jiangtann