Dai Zhuyun (戴竹韵)
Dai Zhuyun (戴竹韵)
: ) On Tue, May 21, 2019 at 2:58 AM Alishiba Dsouza wrote: > I guess it is clear now. I will try and do it and get back to...
对,这句话可以删掉!是我在debug时为了确保tensor的shape是对的,才加上的。。
您好! 步骤是:分词,再把每个词分配一个ID。可以随机分配。 输入样本是query的词的id \t relevant document的词的id \t Irrelevant document 的词的id 比如 Query: Baidu Relevant Doc: baidu.com Irrelevant Doc: yahoo.com 词和id的map: Baidu 1 Yahoo 2 Com 3 那么我的样本是 1 \t 1,3...
不客气~欢迎提问和讨论! On Fri, Apr 19, 2019 at 10:11 PM Chandler-Bing wrote: > 谢谢师姐!(真的谢谢谢谢谢!) > > — > You are receiving this because you commented. > > > Reply to this...
negative samples在训练的时候可以sub-sampling,不用把所有的文章都用上 On Mon, Apr 22, 2019 at 8:51 AM Chandler-Bing wrote: > 师姐您好,项目中有个小问题,假设我的用的20ng的数据集,一共20个类,每个类500篇文档的话,不到40M的原数据,如果每个类有5个种子词,每篇文档中有300个不同的词的话。那训练集的格式就是20 > * 500 * (19*500) ,种子词有20中选择,每个种子词类别对应500篇pos文档,对应19* > 500篇neg文档,那这样的话再乘以每篇文档300个词的编号,训练集会非常大,感觉40M文本的数据集处理成10几个g的训练集,冗余的信息是不是太多了?感觉像是无意义的扩充。。。(PS,我这样处理的过程是正确的吗?感谢师姐解答( > *^_^*)) > > — > You are...
Sorry, currently we do not have any sample datasets. But I am happy to answer any questions on preparing your own datasets.
Hi Craig, "What's the difference between collection_pred_1 and collection_pred_2? Is this MSMARCO passage vs document corpora?" -> Sorry for the confusion, they each contain half of MSMARCO passage collection. "Providing...
Thanks for providing the numbers! I have updated the data folder with test_results.tsv.gz files. In addition, I also uploaded the bert_term_sample_to_json.py output for MS MARCO at weighted_documents/.
Sorry about the confusing field names! "query" is not used, and "term_recall" is sufficient. I put document into "title" for some legacy issue from experiments with other datasets.
This is super interesting! I tried using [1, 10, 100, 1000], and found that 100 in general worked the best for DeepCT. When using small values (e.g., 1 and 10),...