RecBole [🐛BUG] IndexError: Can not load the data without registration !

Hello! Thank you for a such a great tool!

描述这个 bug

在复现RecBole-PJF过程中，出现了报错：IndexError: Can not load the data without registration !

debug 分析如下 通过debug发现 build 函数中的 dataset 除了 dataset[0] 不为空，dataset[1] 和 dataset[2]都为空。初步分析是数据集划分出错，train中有数据，valid和test中没有数据。

由于 datasets = super(PJFDataset, self).build()

class PJFDataset(Dataset): """:class:PJFDataset is inherited from :class:recbole.data.dataset.Dataset

调用的是recbole中的函数recbole.data.dataset.Dataset 查看了recbole对应的函数没有发现错误。

预期

请问：recbole框架中是具体在【哪个文件中的哪个函数】进行数据集划分？为什么会出现这种train有数据，valid和test没有数据集的情况呢? 如何能让[RecBole-PJF]跑通？

如何复现 复现这个 bug 的步骤：

overall.yaml 如下图
zhilian.yaml 如下图：

实验环境（请补全下列信息）：

操作系统: [Windows]
RecBole 版本 [RecBole-PJF]
Python 版本 [3.11]
PyTorch 版本 [2.1.0+cu121]
cudatoolkit 版本 [ CUDA Version: 12.2]
dataset [zhilian]
(先跑了prepare_zhilian.py得到了原子文件)

十分期盼您的回复，感谢您的宝贵时间。

Nov 19 '23 15:11 crystalwang2020

@crystalwang2020 您好！关于数据集划分，您可以通过配置eval_args 参数实现，具体您可以参考我们的文档。

Nov 20 '23 02:11 zhengbw0324

@crystalwang2020 您好！关于数据集划分，您可以通过配置eval_args 参数实现，具体您可以参考我们的文档。

十分感谢您的及时回复。不好意思，之前的overall.yaml截图不全，如下图所示。参照您的reference link，看到数据集划分是设置的eval_args: split: {'RS':[0.8,0.1,0.1]}。可是实际划分出来valid和test为空，请问是怎么回事呢？还有哪里会影响数据集划分吗？

如下eval_neg_sample_args: 参数是我根据报错，参照 train_neg_sample_args: 加上去的：请问这部分参数设置有文档可以参照么？

eval_neg_sample_args:

distribution: uniform sample_num: 1
sample_num: 1
alpha: 1.0
dynamic: False

Nov 20 '23 02:11 crystalwang2020

运行命令可以贴一下嘛？运行的数据集是zhilian吗？我这边完全你发的参数配置能正常运行

Nov 27 '23 09:11 flust

运行命令可以贴一下嘛？运行的数据集是zhilian吗？我这边完全你发的参数配置能正常运行

运行命令为： python run_recbole_pjf.py -d zhilian -m PJFNN python run_recbole_pjf.py -d zhilian -m DPGNN

问题： 请问overall.yaml参数配置eval_neg_sample_args有错么？为什么不论什么模型【DPGNN，PJFNN，BERT，BPR等】实验结果跑出来始终recall=1，precision@5=0.2, precision@10=0.1?

如下图：

十分感谢您的回复！

Dec 04 '23 13:12 crystalwang2020

Were you able to figure out your data registration error? I have the same.

Dec 06 '23 03:12 anushka48

那个数据集的问题您那边解决了嘛？我这边还是复现不出来您的问题。。。如果解决了可以把解决方法发一下嘛？
我这边按照你的yaml设置还是跑不出来您的问题，我这边结果还是正常的。。。几个重要环境如下，可以做一个参考：操作系统: [Mac] recbole 1.1.1 Python 3.7.16 torch 1.13.1 torch-geometric 2.3.1
关于recall=1，precision=0.2的问题，分析现象的话可能是测试集设置过于简单或过拟合，正确的item完全被排在了前5，这样top5的召回率就是1（正例都被排在了前5），准确率是0.2（前5里有一个正例） / 0.1（前10里有一个正例），如果是这个原因，可以尝试把评测方式改为全排序试试：eval_args 下 mode 设置为 uni20 ps：我这边默认设置下用CPU，python run_recbole_pjf.py -d zhilian -m BPR 大致会跑30个epoch，recall@5大概在0.3左右

Dec 17 '23 09:12 flust

那个数据集的问题您那边解决了嘛？我这边还是复现不出来您的问题。。。如果解决了可以把解决方法发一下嘛？

我这边按照你的yaml设置还是跑不出来您的问题，我这边结果还是正常的。。。几个重要环境如下，可以做一个参考：操作系统: [Mac] recbole 1.1.1 Python 3.7.16 torch 1.13.1 torch-geometric 2.3.1

关于recall=1，precision=0.2的问题，分析现象的话可能是测试集设置过于简单或过拟合，正确的item完全被排在了前5，这样top5的召回率就是1（正例都被排在了前5），准确率是0.2（前5里有一个正例） / 0.1（前10里有一个正例），如果是这个原因，可以尝试把评测方式改为全排序试试：eval_args 下 mode 设置为 uni20 ps：我这边默认设置下用CPU，python run_recbole_pjf.py -d zhilian -m BPR 大致会跑30个epoch，recall@5大概在0.3左右

问题1： 我们通过debug发现原始code的grouped_inter_feat_index是list嵌套list形式，如[[0],[1],[2],[3],...,[34]],因而无法将数据集依照id进行数据划分。故将list嵌套改成list格式，即[0,1,2,3,4,...34]。更改的代码部分如下图，图左下角有具体的文件位置。为1659-1669行，code by ojw 231121。

问题2+3：

我重新按照您的config 重新配置了wsl环境，更改了eval_args 下 mode 设置为 uni20，并贴上了完整的overall.yaml如文最后的图。 terminal 命令为： (py37) drwang@Hao:/mnt/c/model/RecBole-PJF-main$ python run_recbole_pjf.py -m BPR -d zhilian

结果为： test result: (OrderedDict([('recall@5', 0.0), ('precision@5', 0.0), ('ndcg@5', 0.0), ('mrr@5', 0.0)]), OrderedDict([('recall@5', 0.0), ('precision@5', 0.0), ('ndcg@5', 0.0), ('mrr@5', 0.0)]))，如下图：

如果更改overall.yaml如下 topk: [10] valid_metric: recall@10

terminal 命令同上为： python run_recbole_pjf.py -m BPR -d zhilian

结果为： test result: (OrderedDict([('recall@10', 0.3333), ('precision@10', 0.0333), ('ndcg@10', 0.1003), ('mrr@10', 0.037)]), OrderedDict([('recall@10', 0.0), ('precision@10', 0.0), ('ndcg@10', 0.0), ('mrr@10', 0.0)])) 如下图：

完整的overall.yaml如下图：

提问：

您能不能贴出所有包的版本和完整的overall.yaml？我来对照配置一下。
请问为什么会出现结果全指标为0的情况？

十分感谢您的回复，盼复~

Dec 19 '23 13:12 crystalwang2020

@anushka48
We identified through debugging that the 'grouped_inter_feat_index' in the original code is in the form of a list nested within another list, such as [[0], [1], [2], [3], ..., [34]]. Consequently, it was not possible to split the dataset based on ID. Therefore, we modified the nested list to a single list format, i.e., [0, 1, 2, 3, 4, ..., 34].

The modified code is shown in the attached image, with the specific file location indicated in the bottom left corner. The changes were made in lines 1659-1669, code by ojw 231121.

Dec 19 '23 13:12 crystalwang2020

我这边检查grouped_inter_feat_index，就是list嵌套list形式，这个没问题，你那边的运行结果异常应该是改的这里不对。
overall.yaml和zhilian.yaml：
运行记录:
建议数据再检查一下，数据目录处理完：
完整环境： absl-py==1.4.0 anyio==3.7.1 appnope==0.1.3 argon2-cffi==21.3.0 argon2-cffi-bindings==21.2.0 astor==0.8.1 attrs==23.1.0 backcall==0.2.0 beautifulsoup4==4.12.2 bleach==6.0.0 blobfile==2.0.2 cachetools==5.3.2 cffi==1.15.1 charset-normalizer==3.3.0 colorama==0.4.4 colorlog==4.7.2 cycler==0.11.0 debugpy==1.6.7 decorator==5.1.1 defusedxml==0.7.1 entrypoints==0.4 exceptiongroup==1.1.2 fastjsonschema==2.17.1 filelock==3.12.2 fire==0.5.0 fonttools==4.38.0 fsspec==2023.1.0 ftfy==6.1.1 gast==0.2.2 google-auth==2.23.4 google-auth-oauthlib==0.4.6 google-pasta==0.2.0 grpcio==1.56.0 h5py==3.8.0 huggingface-hub==0.16.4 humanize==4.6.0 idna==3.4 imageio==2.31.2 importlib-metadata==6.7.0 importlib-resources==5.12.0 ipykernel==6.16.2 ipython==7.34.0 ipython-genutils==0.2.0 ipywidgets==8.0.7 jedi==0.18.2 jieba==0.42.1 Jinja2==3.1.2 joblib==1.3.2 jsonschema==4.17.3 jupyter==1.0.0 jupyter-console==6.6.3 jupyter-server==1.24.0 jupyter_client==7.4.9 jupyter_core==4.12.0 jupyterlab-pygments==0.2.2 jupyterlab-widgets==3.0.8 Keras-Applications==1.0.8 Keras-Preprocessing==1.1.2 kiwisolver==1.4.5 lxml==4.9.3 Markdown==3.4.3 MarkupSafe==2.1.3 matplotlib==3.5.3 matplotlib-inline==0.1.6 mistune==3.0.1 nbclassic==1.0.0 nbclient==0.7.4 nbconvert==7.6.0 nbformat==5.8.0 nest-asyncio==1.5.6 networkx==2.6.3 notebook==6.5.4 notebook_shim==0.2.3 numpy==1.21.6 oauthlib==3.2.2 opt-einsum==3.3.0 packaging==23.1 pandas==1.3.5 pandocfilters==1.5.0 parso==0.8.3 pexpect==4.8.0 pickleshare==0.7.5 Pillow==9.5.0 pkgutil_resolve_name==1.3.10 plotly==5.18.0 prometheus-client==0.17.1 prompt-toolkit==3.0.39 protobuf==3.19.0 psutil==5.9.5 ptyprocess==0.7.0 pyasn1==0.5.1 pyasn1-modules==0.3.0 pycparser==2.21 pycryptodomex==3.19.0 Pygments==2.15.1 pyparsing==3.1.1 pyrsistent==0.19.3 python-dateutil==2.8.2 pytz==2023.3.post1 PyWavelets==1.3.0 PyYAML==6.0.1 pyzmq==25.1.0 qtconsole==5.4.3 QtPy==2.3.1 recbole==1.1.1 regex==2023.10.3 requests==2.31.0 requests-oauthlib==1.3.1 rsa==4.9 safetensors==0.4.0 scikit-image==0.19.3 scikit-learn==1.0.2 scipy==1.7.3 Send2Trash==1.8.2 six==1.16.0 sniffio==1.3.0 soupsieve==2.4.1 tabulate==0.9.0 tenacity==8.2.3 tensorboard==2.11.2 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorflow-estimator==1.15.1 termcolor==2.3.0 terminado==0.17.1 thop==0.1.1.post2209072238 threadpoolctl==3.1.0 tifffile==2021.11.2 tinycss2==1.2.1 tokenizers==0.13.3 torch==1.13.1 torch-geometric==2.3.1 torchvision==0.14.1 tornado==6.2 tqdm==4.66.1 traitlets==5.9.0 transformers==4.30.2 typing_extensions==4.7.1 urllib3==2.0.6 wcwidth==0.2.6 webencodings==0.5.1 websocket-client==1.6.1 Werkzeug==2.2.3 widgetsnbextension==4.0.8 wrapt==1.15.0 zipp==3.15.0

Dec 20 '23 09:12 flust

@flust 大神您好，感谢您的回复。 回复1：

如果不改list嵌套还是会报错 IndexError: Can not load the data without registration !

改了list嵌套后之后，如果model=BPR，能跑但是指标都为0，如果model=BPJFNN，recall@5为0.3左右

terminal命令：python run_recbole_pjf.py -m BPR -d zhilian

terminal命令： python run_recbole_pjf.py -m BPJFNN -d zhilian

回复2： 我核对了overall.yaml和zhilian.yaml，完全一致。

回复3： 通过对比其他参数，如下确实不同：

Average actions of users: 1.0 The number of items: 23 Average actions of items: 1.5909090909090908 The number of inters: 35

BPR( (user_embedding): Embedding(4501, 128) (item_embedding): Embedding(23, 128) (loss): BPRLoss() )

初步判断是数据差异。我这边数据集item个数为23，您那边为19379。但数据集zhilian.item中确实是19379行。初步判断为数据处理过程出错。

回复4： 数据格式部分是一致的。

回复5： 十分感谢您提供的所有包的版本，已经按照您的版本重新配置了环境，十分感谢您的慷慨与热心帮助。

问题：如果改了list嵌套是错误的，导致了评估指标都为0的结果，不改又会报错IndexError，那么这个IndexError应该如何处理呢？

Dec 20 '23 14:12 crystalwang2020

@flust

IndexError的具体报错： terminal命令为： python run_recbole_pjf.py -m BPR -d zhilian

Dec 20 '23 14:12 crystalwang2020

检查一下数据吧，item数量肯定不对啊

---Original--- From: @.> Date: Wed, Dec 20, 2023 22:51 PM To: @.>; Cc: @.@.>; Subject: Re: [RUCAIBox/RecBole] [🐛BUG] IndexError: Can not load the data without registration ! (Issue #1920)

@flust

IndexError的具体报错： terminal命令为： python run_recbole_pjf.py -m BPR -d zhilian image.png (view on web) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Dec 20 '23 15:12 flust

我基于自己的数据生成了一个适配recbole的数据集，item只有14个，也出现了 raise IndexError("Can not load the data without registration !") 这个问题，请问需要怎么解决呢？

Mar 06 '24 02:03 ysf-gd

我基于自己的数据生成了一个适配recbole的数据集，item只有14个，也出现了 raise IndexError("Can not load the data without registration !") 这个问题，请问需要怎么解决呢？

我double check过，重新下载了数据集，还是会报错，也暂没找到解决办法。

Mar 06 '24 02:03 crystalwang2020

我把”Can not load the data without registration !”这个问题解决了，主要做的是1. 重复正样本（jd-候选人）的数量（因为我的正样本数据有限），2. 添加负样本（主要目的是增加jd，因为我所有的样本加起来涉及到的jd只有14个），为了增加jd数量，我通过添加负样本的方式增加了jd，通过这个方法解决了上述问题

Mar 06 '24 07:03 ysf-gd

RecBole RecBole copied to clipboard

[🐛BUG] IndexError: Can not load the data without registration !

RecBole
RecBole copied to clipboard