tnews.py好像并没用r-drop只有在sentiment用r-drop啦吧?想确认一下!
你好 @bojone 想问一下 unlabeled_data = [(t, 0) for t, l in train_data[num_labeled:]] 这个啥作用 能否举一个例子 我数据看不到。 我看表注的训练数据才展0.01 这样不恨少的训练数据?为啥这样啊?不是很理解,麻烦解答。
模拟标注和非标注数据
num_labeled = int(len(train_data) * train_frac) unlabeled_data = [(t, 0) for t, l in train_data[num_labeled:]] train_data = train_data[:num_labeled]
还有就是如果是三分类怎么弄非标注数据 这是二分类的。还是这样一样统一设置为0?
1、每个都有r-drop。我不至于乱发几个文件糊弄大家吧?
2、unlabeled_data = [(t, 0) for t, l in train_data[num_labeled:]]这一步纯粹是心理安慰,即把所有的无标签数据的标签都设置为0,你直接改为unlabeled_data = train_data[num_labeled:]也完全等价,因为事实上就没用到标签。
@bojone 谢谢回复。 我可能只看到了sentiment 里unlabeled_data = [(t, 0) for t, l in train_data[num_labeled:]] 这一块。其他没有看到,所以造成了误会。 我只是一个建议啊, 如果可以的话, 可以在readme文件里写上各个python file中各种r-drop的方法。我想问一下,为啥无标签要占这么大比例啊,小一点不行吗,比如30%。麻烦赐教!
1、感谢你的意见,但是我认为在熟悉r-drop和keras本身的情况下,阅读我所给的参考代码是轻而易举的;
2、半监督学习本来就是“少量标签数据+大量无标签数据”的场景,你要是有30%的标注数据,我估计都用不着半监督了。
@bojone 谢谢回复。我想问一下下面代码啥意思。为啥有一个 for i in range(2)? ``` for i in range(2): batch_token_ids.append(token_ids) batch_segment_ids.append(segment_ids) batch_labels.append(label) if len(batch_token_ids) == self.batch_size * 2 or is_end: batch_token_ids = sequence_padding(batch_token_ids) batch_segment_ids = sequence_padding(batch_segment_ids) batch_labels = to_categorical(batch_labels, num_classes) yield [batch_token_ids, batch_segment_ids], batch_labels batch_token_ids, batch_segment_ids, batch_labels = [], [], []
你好我用你这个sentiment drop code去做一个三分类的相似度预测。我发现用你的drop咋差了不少啊,这个准确度和f1指标在val数据很不好(69%),相比没有用drop的(80%)。我想知道啥原因,麻烦你解答一下。代码如下:
class data_generator(DataGenerator):
"""数据生成器
"""
def __iter__(self, random=False):
batch_token_ids, batch_segment_ids, batch_labels = [], [], []
for is_end, (text1, text2, label) in self.sample(random):
token_ids, segment_ids = tokenizer.encode(
text1, text2, maxlen=maxlen
)
batch_token_ids.append(token_ids)
batch_segment_ids.append(segment_ids)
batch_labels.append([label])
if len(batch_token_ids) == self.batch_size or is_end:
batch_token_ids = sequence_padding(batch_token_ids)
batch_segment_ids = sequence_padding(batch_segment_ids)
batch_labels = sequence_padding(batch_labels)
yield [batch_token_ids, batch_segment_ids], batch_labels
batch_token_ids, batch_segment_ids, batch_labels = [], [], []
class data_generator_rdrop(DataGenerator):
"""数据生成器
"""
def __iter__(self, random=False):
batch_token_ids, batch_segment_ids, batch_labels = [], [], []
for is_end, (text1, text2, label) in self.sample(random):
token_ids, segment_ids = tokenizer.encode(text1, text2, maxlen=maxlen)
for i in range(2):
batch_token_ids.append(token_ids)
batch_segment_ids.append(segment_ids)
batch_labels.append([label])
if len(batch_token_ids) == self.batch_size * 2 or is_end:
batch_token_ids = sequence_padding(batch_token_ids)
batch_segment_ids = sequence_padding(batch_segment_ids)
batch_labels = sequence_padding(batch_labels)
yield [batch_token_ids, batch_segment_ids], batch_labels
batch_token_ids, batch_segment_ids, batch_labels = [], [], []
def kld_rdrop(y_true, y_pred):
"""无监督部分只需训练KL散度项
"""
loss = kld(y_pred[::2], y_pred[1::2]) + kld(y_pred[1::2], y_pred[::2])
return K.mean(loss)
bert = build_transformer_model(
config_path=config_path,
checkpoint_path=checkpoint_path,
with_pool=True,
model="bert",
return_keras_model=False,
)
output = Dropout(rate=0.1)(bert.model.output)
output = Dense(
units=len(labels), activation='softmax', kernel_initializer=bert.initializer
)(output)
model = keras.models.Model(bert.model.input, output)
model.summary()
model.compile(
loss='sparse_categorical_crossentropy',
optimizer=Adam(2e-5),
# optimizer=PiecewiseLinearLearningRate(Adam(5e-5), {10000: 1, 30000: 0.1}),
metrics=['accuracy'],
)
# 用于R-Drop训练的模型
model_rdrop = keras.models.Model(bert.model.input, output)
model_rdrop.compile(
loss=kld_rdrop,
optimizer=Adam(1e-5),
)