xalss
xalss
In the attention.py file of modules, line 53 is the process of calculating the attention weight in the following picture. The value of mask filling should be -1e7, right?
Is there a strict procedure for back translation? I used fairseq's ende transformer pre-training model to get back translation data for training uda, but I can’t get a good result....
## 1. 模型结构 看论文中的描述,关键字注意力层和常规 transformer 层分别接在 11 层常规 transformer 之后,但是看源码中,貌似并不是这样,也就是 modeling.py 的第 212、226 行,类似于一个双塔结构,它们共享的只有 embedding 层? ## 2. kw_mask attention 在生成这个 mask 的过程中,cls 和 sep 三行中如果不经过特殊处理应该在进入 softmax 之前全部被填充成 -10000,那这三行在进行 softmax...
Why is it said that only ds_zero is currently doing world_size streams on world_size gpus, while acclerate and ds inference should be doing the same as well since they also...
80w+数据,一个epoch还没跑完,600步结果其实还不错,越到后面\n重复的问题越严重,且训练集和验证集的loss都在保持下降
多次重新下载diff权重,合并得到的sum值一直都是 all sum : 49715.3515625