RecBole icon indicating copy to clipboard operation
RecBole copied to clipboard

[๐Ÿ›BUG] Batch evaluation does not realised

Open SergeyPetrakov opened this issue 2 years ago โ€ข 8 comments

Describe the bug While I was training an algorithm I found that evaluation batch size is 1 in fact, so it does not depend on batch size that I set

To Reproduce Steps to reproduce the behavior: python3 run_recbole.py --model=BPR --dataset=MY_CUSTOM_DATA --config_files=config_for_test_general_BPR.yaml

What received

...

Training Hyper Parameters:
epochs = 100
train_batch_size = 16384
learner = adam
learning_rate = 0.001
train_neg_sample_args = {'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0}
eval_step = 10
stopping_step = 3
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

...

Evaluation Hyper Parameters:
eval_args = {'split': {'RS': [0.8, 0.1, 0.1]}, 'order': 'TO', 'group_by': 'user', 'mode': {'valid': 'full', 'test': 'full'}}
repeatable = False
metrics = ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
topk = [10]
valid_metric = NDCG@10
valid_metric_bigger = True
eval_batch_size = 16384
metric_decimal_place = 4

...

The number of users: 95882
Average actions of users: 36.13249757511916
The number of items: 50149
Average actions of items: 69.08391162160007
The number of inters: 3464420
...

Train     0: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 171/171 [00:15<00:00, 11.13it/s]
 ....
Evaluate   : 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 95881/95881 [04:19<00:00, 369.70it/s]

we see that 95881 during evaluation equals to 'The number of users', hovever during training it is obvious that batch is indeed really high (16384)

Desktop (please complete the following information):

  • OS: Linux
  • RecBole Version 1.1.1
  • Python Version 3.10.6
  • PyTorch Version 2.0.1

SergeyPetrakov avatar Sep 25 '23 15:09 SergeyPetrakov

I see in some answers to issues (like here https://github.com/RUCAIBox/RecBole/issues/1866) that you significantly increase the value of eval batch size (like 1000 times). I tried this technique and received

train_batch_size = 16384 eval_batch_size = 163840000 The number of users: 173932 Average actions of users: 39.204006186361255 The number of items: 75004 Average actions of items: 90.91359012306174 The number of inters: 6818792

and for this parameters Train 0: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 337/337 [01:15<00:00, 4.47it/s] Evaluate : 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 80/80 [01:01<00:00, 1.31it/s]

I observe that on evaluation stage now only 80 batches. Could you please clarify why is it so, I do not understand how this number (80) appear

SergeyPetrakov avatar Sep 26 '23 09:09 SergeyPetrakov

@SergeyPetrakov Hello! Actually,the real eval_batch_size = eval_batch_size(that you set)// item_number,because each test case needs to be scored with all items. Each user has item_number's data. Therefore๏ผŒwhen you set eval_batch_size = 163840000, eval_batch_size = 163840000 // 75004 (2184). So this number(80) (173932 // 2184 ) appears.

TayTroye avatar Sep 26 '23 13:09 TayTroye

Great! Thank you for the answer!

SergeyPetrakov avatar Sep 27 '23 10:09 SergeyPetrakov

@SergeyPetrakov Hello! Actually,the real eval_batch_size = eval_batch_size(that you set)// item_number,because each test case needs to be scored with all items. Each user has item_number's data. Therefore๏ผŒwhen you set eval_batch_size = 163840000, eval_batch_size = 163840000 // 75004 (2184). So this number(80) (173932 // 2184 ) appears.

@TayTroye If real eval_batch_size = eval_batch_size(that you set)// item_number = 0 ( eg 128 // 1000 )๏ผŒWhat is the value of the real eval_batch_size at this time?

Thanks!

KpiHang avatar Oct 09 '23 08:10 KpiHang

The value of the real eval_batch_size will be 1 at this time. You can check it in our code:https://github.com/RUCAIBox/RecBole/blob/00c018ed4458c20edf1d62ffc7f5f956ea5d3d42/recbole/data/dataloader/general_dataloader.py#L244

TayTroye avatar Oct 09 '23 09:10 TayTroye

https://github.com/RUCAIBox/RecBole/blob/00c018ed4458c20edf1d62ffc7f5f956ea5d3d42/recbole/data/dataloader/general_dataloader.py#L244-L253 @TayTroye Thank you for your reply.

I use SASRec, a sequencial recommender. self.is_sequential == True

I'm very confused. following๏ผš

dataset Steam, downloaded from processed dataset : https://drive.google.com/drive/folders/1ahiLmzU7cGRPXf5qGMqtAChte2eYp9gI

09 Oct 16:59    INFO  steam
The number of users: 2567539
Average actions of users: 1.1852385436943873
The number of items: 14431
Average actions of items: 210.8901593901594
The number of inters: 3043145
The sparsity of the dataset: 99.99178686104865%
Remain Fields: ['user_id', 'product_id', 'timestamp']

3 cases:

eval_args.mode: pop100
train_batch_size: 128
eval_batch_size: 128
==>
Train        : 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2972/2972 
Evaluate   : 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 36586/36586 
eval_args.mode: pop100
train_batch_size: 128
eval_batch_size: 1280
==>
Train        : 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2972/2972
Evaluate   : 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 3049/3049 
eval_args.mode: pop100
train_batch_size: 128
eval_batch_size: 4096
==> 
Train       : 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 2972/2972 
Evaluate   : 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 915/915

I dont know how to compute out the nums of Evaluate. e.g. 36586, 3049, 915.

I really need your help, thank you!

KpiHang avatar Oct 09 '23 10:10 KpiHang

When you set mode: pop100 , the batch_size and step will be set like this https://github.com/RUCAIBox/RecBole/blob/00c018ed4458c20edf1d62ffc7f5f956ea5d3d42/recbole/data/dataloader/general_dataloader.py#L120

TayTroye avatar Oct 09 '23 15:10 TayTroye

@TayTroye thx

ๆ‚จๅฅฝ๏ผŒๆˆ‘็š„้—ฎ้ข˜็กฎๅฎžๅœจๆ‚จๆๅˆฐ็š„ไปฃ็ ไธญๅฏไปฅ่ฏดๆ˜Ž๏ผŒๆˆ‘ไปๆœ‰ไธ€ไบ›้—ฎ้ข˜ใ€‚

ไธบไป€ไนˆ๏ผŒ่ฟ™้‡Œ่ฆ self.uid2items_num * self.times ็œ‹ไบ†ไธ€ไธ‹ self.times็š„ๆฅๆบ๏ผˆeg. pop100๏ผ‰ CEๆ˜ฏ101 ,BPR 100;

uid2items_num ไธŠ้ขๆ˜ฏไธ€ไธช้•ฟๅบฆไธบuser_num ็š„ๆ•ฐ็ป„๏ผŒ่ฟ™ๆฎตไปฃ็ ่ฟ™ๆ ทๅš็š„้€ป่พ‘ๆ˜ฏไป€ไนˆ๏ผŒๆˆ‘ๆœ‰ไบ›ไธๆ˜Ž็™ฝใ€‚

่ฟ˜ๆœ‰ๅฐฑๆ˜ฏ๏ผš ไธบไป€ไนˆไธๆ˜ฏๆˆ‘่ฎพ็ฝฎไบ†eval_batch_size ๅŽ๏ผŒ็›ดๆŽฅ็”จ้ชŒ่ฏ้›†ๆ ทๆœฌๆ€ปๆ•ฐ / eval_batch_size ?

ๅฅฝๅƒๆ— ๆณ•ๅพˆๆ–นไพฟ็š„ๆŽงๅˆถ real eval_batch_size ๅคงๅฐใ€‚

KpiHang avatar Oct 10 '23 02:10 KpiHang