baidu_ultr_dataset issues

Download links not working.

1

Hi, I tried to download your dataset, but none of the links are working for me. It downloads only empty zip archives. Google drives returns 404.

Kesanov

How to use other features？

3

Hi, I find we only use query, title, abstract to train the model. But there is a lot of other features in dataset, including Continuous Value and Discrete Number. How...

we1559

Document Order in Test Dataset

4

Thank you very much for making this dataset public. I have a quick question about the test dataset. I understand that for each query you take the top30 documents from...

citizenkeynes

Would you add a LICENSE file?

1

可否添加一个LICENSE文件？这样大家就知道可以用数据集做什么、不能做什么。

wangshusen

在数据探索过程中发现训练集合中对于文档存在两个命名相同的特征列' Displayed Count', 在数据集合的[网页](https://searchscience.baidu.com/dataset_ultr.html)的列名解释中也发现了'Displayed Count'这一特征出现了两次,实际检查特征对应的值时确发现这两列的数值并不完全一致,请问是什么造成了这样的现象呢,或者说应该以哪一列的数值为准. 以下是part-00001.gz 的结果展示 ![image](https://user-images.githubusercontent.com/29994223/208321929-35beb3d5-dd3a-4445-bbe8-6ce60cb66abe.png)

gluver

Question about the Reformulated Queries

3

Hi, It seems that the subsequent reformulated queries are not separated. Only the tokens are separated. For example, here is the first line from `part-00000` `b'10000014169022957140\t4241\x015865\x013472\x0112631\x012962\x018468\x0116789\t4241\x015865\x013472\x0112631\x019066\x0112307\x017966\x016488\x016145\x012689\x014019\x0118161\x0121376\x014241\x015865\x013472\x0112631\x012962\x018468\x0116789\x0121376\x0115038\x0110191\x011251\x016488\x019066\x0112307\x017966\x016488\n'` The query id is...

tesixiao

How to fastly download the large training dataset?

There is a demo downloading dataset code: https://github.com/ChuXiaokai/baidu_ultr_dataset/blob/main/download_train_data.py ``` import os for i in range(10): name = '0' * (5 - len(str(i))) + str(i) cmd = "wget " + "...

zoulixin93