baidu_ultr_dataset icon indicating copy to clipboard operation
baidu_ultr_dataset copied to clipboard

an unbias-learning-to-rank dataset of Baidu

Results 9 baidu_ultr_dataset issues
Sort by recently updated
recently updated
newest added

Hi, I tried to download your dataset, but none of the links are working for me. It downloads only empty zip archives. Google drives returns 404.

Hi, I find we only use query, title, abstract to train the model. But there is a lot of other features in dataset, including Continuous Value and Discrete Number. How...

Thank you very much for making this dataset public. I have a quick question about the test dataset. I understand that for each query you take the top30 documents from...

可否添加一个LICENSE文件?这样大家就知道可以用数据集做什么、不能做什么。

在数据探索过程中发现训练集合中对于文档存在两个命名相同的特征列' Displayed Count', 在数据集合的[网页](https://searchscience.baidu.com/dataset_ultr.html)的列名解释中也发现了'Displayed Count'这一特征出现了两次,实际检查特征对应的值时确发现这两列的数值并不完全一致,请问是什么造成了这样的现象呢,或者说应该以哪一列的数值为准. 以下是part-00001.gz 的结果展示 ![image](https://user-images.githubusercontent.com/29994223/208321929-35beb3d5-dd3a-4445-bbe8-6ce60cb66abe.png)

Hi, It seems that the subsequent reformulated queries are not separated. Only the tokens are separated. For example, here is the first line from `part-00000` `b'10000014169022957140\t4241\x015865\x013472\x0112631\x012962\x018468\x0116789\t4241\x015865\x013472\x0112631\x019066\x0112307\x017966\x016488\x016145\x012689\x014019\x0118161\x0121376\x014241\x015865\x013472\x0112631\x012962\x018468\x0116789\x0121376\x0115038\x0110191\x011251\x016488\x019066\x0112307\x017966\x016488\n'` The query id is...

There is a demo downloading dataset code: https://github.com/ChuXiaokai/baidu_ultr_dataset/blob/main/download_train_data.py ``` import os for i in range(10): name = '0' * (5 - len(str(i))) + str(i) cmd = "wget " + "...

举办方你好~ 我想请问一下在完成task2的过程中,是否对finetune阶段使用的方法有所限制呢,因为我理解如果不加限制,那么task1和task2是不是相辅相成的~ 在task1去偏榜单中得到一个好的成绩,如果在预训练模型不变的情况下,在task2的榜单也可以得到一个好的成绩😂