RecBole [🐛BUG] Batch evaluation does not realised

Describe the bug While I was training an algorithm I found that evaluation batch size is 1 in fact, so it does not depend on batch size that I set

To Reproduce Steps to reproduce the behavior: python3 run_recbole.py --model=BPR --dataset=MY_CUSTOM_DATA --config_files=config_for_test_general_BPR.yaml

What received

...

Training Hyper Parameters:
epochs = 100
train_batch_size = 16384
learner = adam
learning_rate = 0.001
train_neg_sample_args = {'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0}
eval_step = 10
stopping_step = 3
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

...

Evaluation Hyper Parameters:
eval_args = {'split': {'RS': [0.8, 0.1, 0.1]}, 'order': 'TO', 'group_by': 'user', 'mode': {'valid': 'full', 'test': 'full'}}
repeatable = False
metrics = ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
topk = [10]
valid_metric = NDCG@10
valid_metric_bigger = True
eval_batch_size = 16384
metric_decimal_place = 4

...

The number of users: 95882
Average actions of users: 36.13249757511916
The number of items: 50149
Average actions of items: 69.08391162160007
The number of inters: 3464420
...

Train     0: 100%|████████████████████████████████████████████████| 171/171 [00:15<00:00, 11.13it/s]
 ....
Evaluate   : 100%|███████████████████████████████████████████| 95881/95881 [04:19<00:00, 369.70it/s]

we see that 95881 during evaluation equals to 'The number of users', hovever during training it is obvious that batch is indeed really high (16384)

Desktop (please complete the following information):

OS: Linux
RecBole Version 1.1.1
Python Version 3.10.6
PyTorch Version 2.0.1

Sep 25 '23 15:09 SergeyPetrakov

I see in some answers to issues (like here https://github.com/RUCAIBox/RecBole/issues/1866) that you significantly increase the value of eval batch size (like 1000 times). I tried this technique and received

train_batch_size = 16384 eval_batch_size = 163840000 The number of users: 173932 Average actions of users: 39.204006186361255 The number of items: 75004 Average actions of items: 90.91359012306174 The number of inters: 6818792

and for this parameters Train 0: 100%|████████████████████████████████████████████████| 337/337 [01:15<00:00, 4.47it/s] Evaluate : 100%|██████████████████████████████████████████████████| 80/80 [01:01<00:00, 1.31it/s]

I observe that on evaluation stage now only 80 batches. Could you please clarify why is it so, I do not understand how this number (80) appear

Sep 26 '23 09:09 SergeyPetrakov

@SergeyPetrakov Hello! Actually,the real eval_batch_size = eval_batch_size(that you set)// item_number,because each test case needs to be scored with all items. Each user has item_number's data. Therefore，when you set eval_batch_size = 163840000, eval_batch_size = 163840000 // 75004 (2184). So this number(80) (173932 // 2184 ) appears.

Sep 26 '23 13:09 TayTroye

Great! Thank you for the answer!

Sep 27 '23 10:09 SergeyPetrakov

@SergeyPetrakov Hello! Actually,the real eval_batch_size = eval_batch_size(that you set)// item_number,because each test case needs to be scored with all items. Each user has item_number's data. Therefore，when you set eval_batch_size = 163840000, eval_batch_size = 163840000 // 75004 (2184). So this number(80) (173932 // 2184 ) appears.

@TayTroye If real eval_batch_size = eval_batch_size(that you set)// item_number = 0 ( eg 128 // 1000 )，What is the value of the real eval_batch_size at this time?

Thanks!

Oct 09 '23 08:10 KpiHang

The value of the real eval_batch_size will be 1 at this time. You can check it in our code:https://github.com/RUCAIBox/RecBole/blob/00c018ed4458c20edf1d62ffc7f5f956ea5d3d42/recbole/data/dataloader/general_dataloader.py#L244

Oct 09 '23 09:10 TayTroye

https://github.com/RUCAIBox/RecBole/blob/00c018ed4458c20edf1d62ffc7f5f956ea5d3d42/recbole/data/dataloader/general_dataloader.py#L244-L253 @TayTroye Thank you for your reply.

I use SASRec, a sequencial recommender. self.is_sequential == True

I'm very confused. following：

dataset Steam, downloaded from processed dataset : https://drive.google.com/drive/folders/1ahiLmzU7cGRPXf5qGMqtAChte2eYp9gI

09 Oct 16:59    INFO  steam
The number of users: 2567539
Average actions of users: 1.1852385436943873
The number of items: 14431
Average actions of items: 210.8901593901594
The number of inters: 3043145
The sparsity of the dataset: 99.99178686104865%
Remain Fields: ['user_id', 'product_id', 'timestamp']

3 cases:

eval_args.mode: pop100
train_batch_size: 128
eval_batch_size: 128
==>
Train        : 100%|█████████████████████| 2972/2972 
Evaluate   : 100%|███████████████████| 36586/36586

eval_args.mode: pop100
train_batch_size: 128
eval_batch_size: 1280
==>
Train        : 100%|█████████████████████| 2972/2972
Evaluate   : 100%|█████████████████████| 3049/3049

eval_args.mode: pop100
train_batch_size: 128
eval_batch_size: 4096
==> 
Train       : 100%|█████████████████████| 2972/2972 
Evaluate   : 100%|███████████████████████| 915/915

I dont know how to compute out the nums of Evaluate. e.g. 36586, 3049, 915.

I really need your help, thank you!

Oct 09 '23 10:10 KpiHang

When you set mode: pop100 , the batch_size and step will be set like this https://github.com/RUCAIBox/RecBole/blob/00c018ed4458c20edf1d62ffc7f5f956ea5d3d42/recbole/data/dataloader/general_dataloader.py#L120

Oct 09 '23 15:10 TayTroye

@TayTroye thx

您好，我的问题确实在您提到的代码中可以说明，我仍有一些问题。

为什么，这里要 self.uid2items_num * self.times 看了一下 self.times的来源（eg. pop100） CE是101 ,BPR 100;

uid2items_num 上面是一个长度为user_num 的数组，这段代码这样做的逻辑是什么，我有些不明白。

还有就是：为什么不是我设置了eval_batch_size 后，直接用验证集样本总数 / eval_batch_size ?

好像无法很方便的控制 real eval_batch_size 大小。

Oct 10 '23 02:10 KpiHang