Dai Zhuyun (戴竹韵) comments

Results 20 comments of


                                            Dai Zhuyun (戴竹韵)

Dataset Preparation

: ) On Tue, May 21, 2019 at 2:58 AM Alishiba Dsouza wrote: > I guess it is clear now. I will try and do it and get back to...

mu = tf.reshape(input_mu, shape=[1, 1, self.n_bins]) 这个操作有什么特殊意思吗?

对，这句话可以删掉！是我在debug时为了确保tensor的shape是对的，才加上的。。

您好！步骤是：分词，再把每个词分配一个ID。可以随机分配。输入样本是query的词的id \t relevant document的词的id \t Irrelevant document 的词的id 比如 Query: Baidu Relevant Doc: baidu.com Irrelevant Doc: yahoo.com 词和id的map： Baidu 1 Yahoo 2 Com 3 那么我的样本是 1 \t 1,3...

关于怎么处理文档

不客气～欢迎提问和讨论！ On Fri, Apr 19, 2019 at 10:11 PM Chandler-Bing wrote: > 谢谢师姐！（真的谢谢谢谢谢！） > > — > You are receiving this because you commented. > > > Reply to this...

关于怎么处理文档

negative samples在训练的时候可以sub-sampling，不用把所有的文章都用上 On Mon, Apr 22, 2019 at 8:51 AM Chandler-Bing wrote: > 师姐您好，项目中有个小问题，假设我的用的20ng的数据集，一共20个类，每个类500篇文档的话，不到40M的原数据，如果每个类有5个种子词，每篇文档中有300个不同的词的话。那训练集的格式就是20 > * 500 * (19*500) ,种子词有20中选择，每个种子词类别对应500篇pos文档，对应19* > 500篇neg文档，那这样的话再乘以每篇文档300个词的编号，训练集会非常大，感觉40M文本的数据集处理成10几个g的训练集，冗余的信息是不是太多了？感觉像是无意义的扩充。。。（PS，我这样处理的过程是正确的吗？感谢师姐解答( > *^_^*)） > > — > You are...

any sample dataset for quick evaluation

Sorry, currently we do not have any sample datasets. But I am happy to answer any questions on preparing your own datasets.

comments/questions

Hi Craig, "What's the difference between collection_pred_1 and collection_pred_2? Is this MSMARCO passage vs document corpora?" -> Sorry for the confusion, they each contain half of MSMARCO passage collection. "Providing...

comments/questions

Thanks for providing the numbers! I have updated the data folder with test_results.tsv.gz files. In addition, I also uploaded the bert_term_sample_to_json.py output for MS MARCO at weighted_documents/.

Why the text of a passage from msmarco is called `title`, and what does the field `position` of a doc mean?

Sorry about the confusing field names! "query" is not used, and "term_recall" is sufficient. I put document into "title" for some legacy issue from experiments with other datasets.

Question regarding quantization

This is super interesting! I tried using [1, 10, 100, 1000], and found that 100 in general worked the best for DeepCT. When using small values (e.g., 1 and 10),...