ERNIE-Pytorch
ERNIE-Pytorch copied to clipboard
有计划增加ernie-3.0吗
收到,会尽快添加
https://github.com/PaddlePaddle/PaddleNLP/issues/2048 看起来目前词表还存在问题
@nghuyong 感谢大佬
收到,会尽快添加
大佬,目前开发进度怎么样呀?
@zhu1090093659 目前词表还存在问题哈
大佬加油,已经star了
官方给出的解释是,这里是一个bug,词表去重后会使用18005位置的$ 见Ernie3.0 词表存在问题 #2487
收到,会近期添加 @Davion1999
感谢!
麻烦大佬有空的话尽快加ernie-3.0了,谢谢呀!
@heyblackC 欢迎贡献,提交pr
ernie-3.0可以用了吗?多谢大佬!
还不行哦,ernie-3.0存在一个task type embedding直接转换有问题,欢迎贡献
大佬,我尝试把ernie3.0转成torch版本,skip了 task type embedding,因为我看了ErnieModel源码中use_task_id默认是false,一般情况下,忽略掉 task type embedding,应该问题不大,
# torch
import torch
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('./ernie-3.0-base-zh-torch')
model = BertModel.from_pretrained('./ernie-3.0-base-zh-torch')
input_ids = torch.tensor([tokenizer.encode(text="你好",add_special_tokens=True)])
with torch.no_grad():
pooled_output = model(input_ids)[1]
print(pooled_output.numpy())
# paddle
import paddle
import paddlenlp
from paddlenlp.transformers import ErnieModel
tokenizer = paddlenlp.transformers.AutoTokenizer.from_pretrained("ernie-3.0-base-zh")
model = paddlenlp.transformers.ErnieModel.from_pretrained("ernie-3.0-base-zh", use_task_id=False)
inputs = tokenizer("你好")
inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()}
print(inputs)
with paddle.no_grad():
sequence_output, pooled_output = model(**inputs)
print(pooled_output.numpy())
转换完成后,进行精度对比,发现还是差的挺多的, torch版本的输出
[[ 9.85730708e-01 -7.40298808e-01 3.95261258e-01 -7.59342790e-01
8.96910310e-01 8.82966697e-01 -6.58721209e-01 -4.71505731e-01
-9.71126974e-01 -9.74366426e-01 -1.87828429e-02 4.24025029e-01
paddle版本的输出
[[ 0.9762421 -0.86734784 0.68027914 -0.12715188 0.9476316 0.98650205
-0.94165057 0.569099 -0.8519937 -0.59184104 -0.900924 0.46361628
-0.95791364 -0.59297466 -0.90437275 0.5445499 -0.06301237 -0.34662145
这样的问题该怎么排查
大佬,我尝试把ernie3.0转成torch版本,skip了 task type embedding,因为我看了ErnieModel源码中use_task_id默认是false,一般情况下,忽略掉 task type embedding,应该问题不大,
# torch import torch from transformers import BertTokenizer, BertModel tokenizer = BertTokenizer.from_pretrained('./ernie-3.0-base-zh-torch') model = BertModel.from_pretrained('./ernie-3.0-base-zh-torch') input_ids = torch.tensor([tokenizer.encode(text="你好",add_special_tokens=True)]) with torch.no_grad(): pooled_output = model(input_ids)[1] print(pooled_output.numpy()) # paddle import paddle import paddlenlp from paddlenlp.transformers import ErnieModel tokenizer = paddlenlp.transformers.AutoTokenizer.from_pretrained("ernie-3.0-base-zh") model = paddlenlp.transformers.ErnieModel.from_pretrained("ernie-3.0-base-zh", use_task_id=False) inputs = tokenizer("你好") inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} print(inputs) with paddle.no_grad(): sequence_output, pooled_output = model(**inputs) print(pooled_output.numpy())
转换完成后,进行精度对比,发现还是差的挺多的, torch版本的输出
[[ 9.85730708e-01 -7.40298808e-01 3.95261258e-01 -7.59342790e-01 8.96910310e-01 8.82966697e-01 -6.58721209e-01 -4.71505731e-01 -9.71126974e-01 -9.74366426e-01 -1.87828429e-02 4.24025029e-01
paddle版本的输出
[[ 0.9762421 -0.86734784 0.68027914 -0.12715188 0.9476316 0.98650205 -0.94165057 0.569099 -0.8519937 -0.59184104 -0.900924 0.46361628 -0.95791364 -0.59297466 -0.90437275 0.5445499 -0.06301237 -0.34662145
这样的问题该怎么排查
调试发现: torch的word_embedding还没经过LN和Dropout
embeddings # 未经过LN和Dropout
tensor([[[ 0.0205, -0.4873, 0.0164, ..., -0.0179, -0.0450, 2.8588],
[ 0.1141, -0.2253, -0.0258, ..., -0.0127, 0.1034, -0.0674],
[ 0.1464, -0.2443, -0.0372, ..., 0.0336, 0.0760, -0.1020],
[ 0.0180, -0.7325, -0.0793, ..., -0.0361, 0.0336, -0.0536]]])
# 经过LN和Dropout
embedding_output
tensor([[[-0.0514, -0.0139, 0.0620, ..., 0.0154, -0.2670, 13.2202],
[ 1.1023, 0.1074, -0.1738, ..., -0.0147, 0.7159, -0.5744],
[ 1.5808, 0.0497, -0.2245, ..., 0.4178, 0.5620, -0.8469],
[-0.0249, -1.0518, -0.4896, ..., -0.1953, 0.1429, -0.4343]]])
paddle的word_embedding还没经过LN和Dropout
embeddings # 未经过LN和Dropout
Tensor(shape=[1, 4, 768], dtype=float32, place=CPUPlace, stop_gradient=True,
[[[ 0.02049049, -0.48732814, 0.01639321, ..., -0.01790927,
-0.04501489, 2.85880589],
[ 0.11406462, -0.22530144, -0.02577418, ..., -0.01266452,
0.10339119, -0.06743017],
[ 0.14644144, -0.24432668, -0.03722043, ..., 0.03356759,
0.07597601, -0.10198678],
[ 0.01801493, -0.73252362, -0.07932932, ..., -0.03605647,
0.03359447, -0.05360591]]])
# 经过LN和Dropout 没有增加task type embedding
embedding_output
Tensor(shape=[1, 4, 768], dtype=float32, place=CPUPlace, stop_gradient=True,
[[[-0.05701001, -0.01576437, 0.06891097, ..., 0.01712356,
-0.29673731, 14.69413280],
[ 1.22636163, 0.11864338, -0.19329374, ..., -0.01649613,
0.79640150, -0.63884997],
[ 0. , 0.05437447, -0.24971111, ..., 0.46465424,
0.62523144, -0.94208169],
[-0.02746268, -1.17038941, -0.54445779, ..., 0. ,
0.15899718, -0.48295474]]])
没有增加task type embedding,而且LN和Dropout的参数都是一致的,底层API实现应该是对齐的,还是出现了差异了 ,可能的原因是什么呐?
大佬,我尝试把ernie3.0转成torch版本,skip了 task type embedding,因为我看了ErnieModel源码中use_task_id默认是false,一般情况下,忽略掉 task type embedding,应该问题不大,
# torch import torch from transformers import BertTokenizer, BertModel tokenizer = BertTokenizer.from_pretrained('./ernie-3.0-base-zh-torch') model = BertModel.from_pretrained('./ernie-3.0-base-zh-torch') input_ids = torch.tensor([tokenizer.encode(text="你好",add_special_tokens=True)]) with torch.no_grad(): pooled_output = model(input_ids)[1] print(pooled_output.numpy()) # paddle import paddle import paddlenlp from paddlenlp.transformers import ErnieModel tokenizer = paddlenlp.transformers.AutoTokenizer.from_pretrained("ernie-3.0-base-zh") model = paddlenlp.transformers.ErnieModel.from_pretrained("ernie-3.0-base-zh", use_task_id=False) inputs = tokenizer("你好") inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} print(inputs) with paddle.no_grad(): sequence_output, pooled_output = model(**inputs) print(pooled_output.numpy())
转换完成后,进行精度对比,发现还是差的挺多的, torch版本的输出
[[ 9.85730708e-01 -7.40298808e-01 3.95261258e-01 -7.59342790e-01 8.96910310e-01 8.82966697e-01 -6.58721209e-01 -4.71505731e-01 -9.71126974e-01 -9.74366426e-01 -1.87828429e-02 4.24025029e-01
paddle版本的输出
[[ 0.9762421 -0.86734784 0.68027914 -0.12715188 0.9476316 0.98650205 -0.94165057 0.569099 -0.8519937 -0.59184104 -0.900924 0.46361628 -0.95791364 -0.59297466 -0.90437275 0.5445499 -0.06301237 -0.34662145
这样的问题该怎么排查
调试发现: torch的word_embedding还没经过LN和Dropout
embeddings # 未经过LN和Dropout tensor([[[ 0.0205, -0.4873, 0.0164, ..., -0.0179, -0.0450, 2.8588], [ 0.1141, -0.2253, -0.0258, ..., -0.0127, 0.1034, -0.0674], [ 0.1464, -0.2443, -0.0372, ..., 0.0336, 0.0760, -0.1020], [ 0.0180, -0.7325, -0.0793, ..., -0.0361, 0.0336, -0.0536]]]) # 经过LN和Dropout embedding_output tensor([[[-0.0514, -0.0139, 0.0620, ..., 0.0154, -0.2670, 13.2202], [ 1.1023, 0.1074, -0.1738, ..., -0.0147, 0.7159, -0.5744], [ 1.5808, 0.0497, -0.2245, ..., 0.4178, 0.5620, -0.8469], [-0.0249, -1.0518, -0.4896, ..., -0.1953, 0.1429, -0.4343]]])
paddle的word_embedding还没经过LN和Dropout
embeddings # 未经过LN和Dropout Tensor(shape=[1, 4, 768], dtype=float32, place=CPUPlace, stop_gradient=True, [[[ 0.02049049, -0.48732814, 0.01639321, ..., -0.01790927, -0.04501489, 2.85880589], [ 0.11406462, -0.22530144, -0.02577418, ..., -0.01266452, 0.10339119, -0.06743017], [ 0.14644144, -0.24432668, -0.03722043, ..., 0.03356759, 0.07597601, -0.10198678], [ 0.01801493, -0.73252362, -0.07932932, ..., -0.03605647, 0.03359447, -0.05360591]]]) # 经过LN和Dropout 没有增加task type embedding embedding_output Tensor(shape=[1, 4, 768], dtype=float32, place=CPUPlace, stop_gradient=True, [[[-0.05701001, -0.01576437, 0.06891097, ..., 0.01712356, -0.29673731, 14.69413280], [ 1.22636163, 0.11864338, -0.19329374, ..., -0.01649613, 0.79640150, -0.63884997], [ 0. , 0.05437447, -0.24971111, ..., 0.46465424, 0.62523144, -0.94208169], [-0.02746268, -1.17038941, -0.54445779, ..., 0. , 0.15899718, -0.48295474]]])
没有增加task type embedding,而且LN和Dropout的参数都是一致的,底层API实现应该是对齐的,还是出现了差异了 ,可能的原因是什么呐?
paddle
embeddings_ 经过LN 没有增加task type embedding
Tensor(shape=[1, 4, 768], dtype=float32, place=CPUPlace, stop_gradient=True,
[[[-0.05130900, -0.01418793, 0.06201987, ..., 0.01541121,
-0.26706359, 13.22471905],
[ 1.10372543, 0.10677904, -0.17396435, ..., -0.01484652,
0.71676135, -0.57496494],
[ 1.58279312, 0.04893702, -0.22474000, ..., 0.41818881,
0.56270826, -0.84787351],
[-0.02471641, -1.05335045, -0.49001199, ..., -0.19554991,
0.14309746, -0.43465924]]])
embedding_output 经过LN和Dropout 没有增加task type embedding
Tensor(shape=[1, 4, 768], dtype=float32, place=CPUPlace, stop_gradient=True,
[[[-0.05701001, -0.01576437, 0.06891097, ..., 0.01712356,
-0.29673731, 14.69413280],
[ 1.22636163, 0.11864338, -0.19329374, ..., -0.01649613,
0.79640150, -0.63884997],
[ 0. , 0.05437447, -0.24971111, ..., 0.46465424,
0.62523144, -0.94208169],
[-0.02746268, -1.17038941, -0.54445779, ..., 0. ,
0.15899718, -0.48295474]]])
发现paddle的dropout把值修改了,dropout不应该是随机丢弃嘛 应该不会修改值吧
犯了一个错误,评估时dropout应该不会随机丢弃,之前的代码有点问题,需要加上model.eval(),表明在评估阶段。这样结果差异就很小了 torch pooled output:
[[ 9.85730708e-01 -7.40298808e-01 3.95261258e-01 -7.59342790e-01
8.96910310e-01 8.82966697e-01 -6.58721209e-01 -4.71505731e-01
-9.71126974e-01 -9.74366426e-01 -1.87828429e-02 4.24025029e-01
-5.76551020e-01 -7.90736675e-01 -9.77571666e-01 8.17567468e-01
6.43071532e-01 -4.70006205e-02 3.44053745e-01 8.76602650e-01
-3.87427926e-01 1.63349375e-01 -5.80719292e-01 -5.60073316e-01
paddle pooled output: 不加task type embedding
[[ 9.85704362e-01 -7.38807738e-01 3.96960974e-01 -7.60077178e-01
8.96803379e-01 8.83422434e-01 -6.55403256e-01 -4.74763840e-01
-9.71261859e-01 -9.74563837e-01 -1.28041422e-02 4.26309526e-01
-5.72801352e-01 -7.90877461e-01 -9.77650106e-01 8.18540633e-01
6.44670546e-01 -5.06419577e-02 3.47291678e-01 8.75799954e-01
-3.90443534e-01 1.62234485e-01 -5.83831012e-01 -5.61231434e-01
@yysirs 👍👍👍 欢迎整理一下提交一个mr,感谢!
将task type embedding也加上了 torch pooled output:+ task type embedding
[[ 0.988166 -0.8854939 0.25455064 -0.58845514 0.93798053 0.8004532
-0.89700645 0.09135557 -0.9623787 -0.9367434 -0.5948328 0.21737
-0.85140526 -0.8041696 -0.97480065 0.68086064 0.2209907 0.26748967
-0.1568218 0.95831865 -0.04741843 0.5229524 -0.23096086 -0.39319956
paddle pooled output:+ task type embedding
[[ 0.9881534 -0.8849289 0.25386345 -0.5901727 0.93801785 0.80024576
-0.8964503 0.08781987 -0.9625627 -0.9371933 -0.59111476 0.21921237
-0.85061604 -0.80437934 -0.974873 0.68235 0.22446258 0.2674618
-0.15446864 0.95807385 -0.05083905 0.5220058 -0.23506579 -0.3951534
@yysirs 👍👍👍 欢迎整理一下提交一个mr,感谢!
好的😁
给大哥的工作点赞!!
辛苦帮忙也check一下这个 https://github.com/nghuyong/ERNIE-Pytorch/issues/50 也是 task type embedding 的问题
将task type embedding也加上了 torch pooled output:+ task type embedding
[[ 0.988166 -0.8854939 0.25455064 -0.58845514 0.93798053 0.8004532 -0.89700645 0.09135557 -0.9623787 -0.9367434 -0.5948328 0.21737 -0.85140526 -0.8041696 -0.97480065 0.68086064 0.2209907 0.26748967 -0.1568218 0.95831865 -0.04741843 0.5229524 -0.23096086 -0.39319956
paddle pooled output:+ task type embedding
[[ 0.9881534 -0.8849289 0.25386345 -0.5901727 0.93801785 0.80024576 -0.8964503 0.08781987 -0.9625627 -0.9371933 -0.59111476 0.21921237 -0.85061604 -0.80437934 -0.974873 0.68235 0.22446258 0.2674618 -0.15446864 0.95807385 -0.05083905 0.5220058 -0.23506579 -0.3951534
加上task type embedding,差异在百分位上,不知道对最后结果有多大的影响
merged!
@yysirs 请教一下,pytorch 是怎么做到 + task type embedding 的 我根据你的转换脚本还不能进行结果的复现
@yysirs 请教一下,pytorch 是怎么做到 + task type embedding 的 我根据你的转换脚本还不能进行结果的复现
我提交的是不带task type embedding的版本,如果需要加task type embedding的话,BERTModel需要进行代码的更改,我后面可以把这个也提交一个pr😊
感谢!! 我刚刚重构了一下代码
感谢!! 我刚刚重构了一下代码
好的,我刚提交了pr,大佬可以看看😁
嗯嗯,我感觉最好的方案还是直接给huggingface发起一个mr,让其支持task type 这个参数
嗯嗯,我感觉最好的方案还是直接给huggingface发起一个mr,让其支持task type 这个参数
可以,我感觉可以把ERNIEModel 提交,毕竟Transformers现在还不支持ERNIEModel😂😂😂😂😂