KDD_CUP_2020_Debiasing_Rush
KDD_CUP_2020_Debiasing_Rush copied to clipboard
great job
hi,dear well done will try to reproduce the rp btw,any metrics for the Recall term
thx
hi,dear well done will try to reproduce the rp btw,any metrics for the Recall term
thx
Hi, The tianchi forum provides the official evaluation scripts, you can refer to https://tianchi.aliyun.com/forum/postDetail?spm=5176.12586969.1002.3.6c3f29e8qbwsHt&postId=102089.
But out rp hasn't provided the offline evaluation code so far. We will provide offline evaluation code as soon as possible.
hi,dear well done will try to reproduce the rp btw,any metrics for the Recall term thx
Hi, The tianchi forum provides the official evaluation scripts, you can refer to https://tianchi.aliyun.com/forum/postDetail?spm=5176.12586969.1002.3.6c3f29e8qbwsHt&postId=102089.
But out rp hasn't provided the offline evaluation code so far. We will provide offline evaluation code as soon as possible.
hi, you can reproduce the rp by the 'offline' branch now. Just Read the 'Evaluation' Part in the updated README.md file.
大佬,这个是在线的还是离线的啊? 怎么部署啊?
大佬,这个是在线的还是离线的啊? 怎么部署啊?
您好,可以参考README文件。有详细说明环境和运行方式。
大佬请教下session-id是啥意思啊?训练的模型里面的
每个user_id有一个session_id ?? 另外请问下phase是啥意思啊?模型也按这个来存储的 请教下咋计算P值啊,我看sr-gnn是求的P值, 召回阶段能求AUC吗?? 多谢大佬
哈喽,大佬,sr-gnn的召回结果怎么评价啊?
按照readme的介绍
You can reproduce these results by checkout the 'offline' branch by git checkout offline and run python3 code/sr_gnn_main.py and python3 code/recall_main.py in sequence.
我先执行的sr_gnn_main.py,后执行的recall_main.py
现在recall_main.py还没结束,如下
train/validate split done...
create offline eval answer done...
begin read item df...
begin compute similarity using faiss...
108916
train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test
(2643000, 4)
(1223242, 4)
using multi_processing
phase: 7
train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test, target_phase=7
user-cf user-sim begin
bi-graph item-sim begin
item-cf item-sim begin
swing item-sim begin
100%|██████████████████████████████████████████████████████████████████████| 18004/18004 [00:00<00:00, 239463.23it/s]
100%|███████████████████████████████████████████████████████████████████████| 45190/45190 [00:02<00:00, 18915.69it/s]
100%|███████████████████████████████████████████████████████████████████████| 18004/18004 [00:00<00:00, 18381.86it/s]
user-cf user-sim-pair done, pair_num=18004
100%|████████████████████████████████████████████████████████████████████████| 45190/45190 [00:07<00:00, 6030.94it/s]
27%|███████████████████▋ | 12385/45190 [00:08<00:21, 1540.94it/s]swing item-sim-pair done, pair_num=45060
100%|████████████████████████████████████████████████████████████████████████| 45190/45190 [00:30<00:00, 1458.91it/s]
bi-graph item-sim-pair done, pair_num=45190
100%|█████████████████████████████████████████████████████████████████████████| 18004/18004 [01:03<00:00, 281.99it/s]
100%|███████████████████████████████████████████████████████████████████████| 45190/45190 [00:04<00:00, 10404.96it/s]
item-cf item-sim-pair done, pair_num=45190
current_len=0
current_len=1
current_len=2
current_len=3
drop duplicates...
recall-source-num=4
do recall for swing
train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test
do recall for user-cf
train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test
do recall for bi-graph
train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test
do recall for item-cf
train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test
这个是正常的吗?似乎也没看懂ndcg啥的啊 多谢
大佬请教下session-id是啥意思啊?训练的模型里面的
session_id只是个模型保存名称的前缀,在sr_gnn_main.py中指定。
哈喽,大佬,sr-gnn的召回结果怎么评价啊? 按照readme的介绍
You can reproduce these results by checkout the 'offline' branch by git checkout offline and run python3 code/sr_gnn_main.py and python3 code/recall_main.py in sequence.
我先执行的sr_gnn_main.py,后执行的recall_main.py 现在recall_main.py还没结束,如下train/validate split done... create offline eval answer done... begin read item df... begin compute similarity using faiss... 108916 train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test (2643000, 4) (1223242, 4) using multi_processing phase: 7 train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test, target_phase=7 user-cf user-sim begin bi-graph item-sim begin item-cf item-sim begin swing item-sim begin 100%|██████████████████████████████████████████████████████████████████████| 18004/18004 [00:00<00:00, 239463.23it/s] 100%|███████████████████████████████████████████████████████████████████████| 45190/45190 [00:02<00:00, 18915.69it/s] 100%|███████████████████████████████████████████████████████████████████████| 18004/18004 [00:00<00:00, 18381.86it/s] user-cf user-sim-pair done, pair_num=18004 100%|████████████████████████████████████████████████████████████████████████| 45190/45190 [00:07<00:00, 6030.94it/s] 27%|███████████████████▋ | 12385/45190 [00:08<00:21, 1540.94it/s]swing item-sim-pair done, pair_num=45060 100%|████████████████████████████████████████████████████████████████████████| 45190/45190 [00:30<00:00, 1458.91it/s] bi-graph item-sim-pair done, pair_num=45190 100%|█████████████████████████████████████████████████████████████████████████| 18004/18004 [01:03<00:00, 281.99it/s] 100%|███████████████████████████████████████████████████████████████████████| 45190/45190 [00:04<00:00, 10404.96it/s] item-cf item-sim-pair done, pair_num=45190 current_len=0 current_len=1 current_len=2 current_len=3 drop duplicates... recall-source-num=4 do recall for swing train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test do recall for user-cf train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test do recall for bi-graph train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test do recall for item-cf train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test
这个是正常的吗?似乎也没看懂ndcg啥的啊 多谢
这个是正常的,复赛总共有3个phase,7,8,9。你目前在跑phase 7。3个phase跑完后,会跑官网给的评估代码进行评估。ndcg和hitrate。
哈喽,大佬,sr-gnn的召回结果怎么评价啊? 按照readme的介绍
You can reproduce these results by checkout the 'offline' branch by git checkout offline and run python3 code/sr_gnn_main.py and python3 code/recall_main.py in sequence.
我先执行的sr_gnn_main.py,后执行的recall_main.py 现在recall_main.py还没结束,如下train/validate split done... create offline eval answer done... begin read item df... begin compute similarity using faiss... 108916 train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test (2643000, 4) (1223242, 4) using multi_processing phase: 7 train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test, target_phase=7 user-cf user-sim begin bi-graph item-sim begin item-cf item-sim begin swing item-sim begin 100%|██████████████████████████████████████████████████████████████████████| 18004/18004 [00:00<00:00, 239463.23it/s] 100%|███████████████████████████████████████████████████████████████████████| 45190/45190 [00:02<00:00, 18915.69it/s] 100%|███████████████████████████████████████████████████████████████████████| 18004/18004 [00:00<00:00, 18381.86it/s] user-cf user-sim-pair done, pair_num=18004 100%|████████████████████████████████████████████████████████████████████████| 45190/45190 [00:07<00:00, 6030.94it/s] 27%|███████████████████▋ | 12385/45190 [00:08<00:21, 1540.94it/s]swing item-sim-pair done, pair_num=45060 100%|████████████████████████████████████████████████████████████████████████| 45190/45190 [00:30<00:00, 1458.91it/s] bi-graph item-sim-pair done, pair_num=45190 100%|█████████████████████████████████████████████████████████████████████████| 18004/18004 [01:03<00:00, 281.99it/s] 100%|███████████████████████████████████████████████████████████████████████| 45190/45190 [00:04<00:00, 10404.96it/s] item-cf item-sim-pair done, pair_num=45190 current_len=0 current_len=1 current_len=2 current_len=3 drop duplicates... recall-source-num=4 do recall for swing train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test do recall for user-cf train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test do recall for bi-graph train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test do recall for item-cf train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test
这个是正常的吗?似乎也没看懂ndcg啥的啊 多谢
这个是正常的,复赛总共有3个phase,7,8,9。你目前在跑phase 7。3个phase跑完后,会跑官网给的评估代码进行评估。ndcg和hitrate。
哈喽,大佬,sr-gnn的召回结果怎么评价啊? 按照readme的介绍
You can reproduce these results by checkout the 'offline' branch by git checkout offline and run python3 code/sr_gnn_main.py and python3 code/recall_main.py in sequence.
我先执行的sr_gnn_main.py,后执行的recall_main.py 现在recall_main.py还没结束,如下train/validate split done... create offline eval answer done... begin read item df... begin compute similarity using faiss... 108916 train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test (2643000, 4) (1223242, 4) using multi_processing phase: 7 train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test, target_phase=7 user-cf user-sim begin bi-graph item-sim begin item-cf item-sim begin swing item-sim begin 100%|██████████████████████████████████████████████████████████████████████| 18004/18004 [00:00<00:00, 239463.23it/s] 100%|███████████████████████████████████████████████████████████████████████| 45190/45190 [00:02<00:00, 18915.69it/s] 100%|███████████████████████████████████████████████████████████████████████| 18004/18004 [00:00<00:00, 18381.86it/s] user-cf user-sim-pair done, pair_num=18004 100%|████████████████████████████████████████████████████████████████████████| 45190/45190 [00:07<00:00, 6030.94it/s] 27%|███████████████████▋ | 12385/45190 [00:08<00:21, 1540.94it/s]swing item-sim-pair done, pair_num=45060 100%|████████████████████████████████████████████████████████████████████████| 45190/45190 [00:30<00:00, 1458.91it/s] bi-graph item-sim-pair done, pair_num=45190 100%|█████████████████████████████████████████████████████████████████████████| 18004/18004 [01:03<00:00, 281.99it/s] 100%|███████████████████████████████████████████████████████████████████████| 45190/45190 [00:04<00:00, 10404.96it/s] item-cf item-sim-pair done, pair_num=45190 current_len=0 current_len=1 current_len=2 current_len=3 drop duplicates... recall-source-num=4 do recall for swing train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test do recall for user-cf train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test do recall for bi-graph train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test do recall for item-cf train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test
这个是正常的吗?似乎也没看懂ndcg啥的啊 多谢
单独评估sr-gnn的话,直接把cf_methods = {'item-cf', 'bi-graph', 'swing', 'user-cf'}改为cf_methods = {},这样只会读取sr-gnn的结果
大佬,train_click.csv这种数据怎么解读啊?能不能解释下啊?如下示例
2255,18,0.984280312637961
18349,35,0.9841110719125039
4489,35,0.9842627512424618
16846,66,0.984069476211683
1888,66,0.9842584372310226
21919,80,0.9842570189950302
执行recall_main.py怎么出现下面的错误呢?
train/validate split done...
create offline eval answer done...
begin read item df...
108916
train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test
(2643000, 4)
(1223242, 4)
using multi_processing
phase: 7
train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test, target_phase=7
drop duplicates...
recall-source-num=0
0
read sr-gnn results....
sr-gnn begin...
sr-gnn rec path=user_data/sr-gnn/offline/7/data/standard_rec.txt
Traceback (most recent call last):
File "my_sr_gnn_eval2.py", line 62, in <module>
recall_methods={'sr-gnn'})
File "/data1/xulm1/debiasing_rush/code/recall/do_recall_multi_processing.py", line 115, in do_multi_recall_results_multi_processing
standard_sr_gnn_recall_item_dict = read_sr_gnn_results(phase, prefix='standard', adjust_type=adjust_type)
File "/data1/xulm1/debiasing_rush/code/recall/sr_gnn/read_sr_gnn_results.py", line 54, in read_sr_gnn_results
with open(sr_gnn_rec_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'user_data/sr-gnn/offline/7/data/standard_rec.txt'
我只用的sr-gnn的v1版本 我看了下运行的时候是(展示部分代码)
def sr_nn_version_1(phase, item_cnt):
model_path = './models/v1/{}/{}'.format(mode, phase)
file_path = '{}/{}/data'.format(sr_gnn_root_dir, phase)
sr_gnn_lib_path = 'code/recall/sr_gnn/lib'
if os.path.exists(model_path):
print('model_path={} exists, delete'.format(model_path))
shutil.rmtree(model_path)
if not os.path.exists(model_path):
os.makedirs(model_path)
os.system("python3 {sr_gnn_lib_path}/my_main_.py --task train --node_count {item_cnt} "
"--checkpoint_path {model_path}/session_id --train_input {file_path}/train_item_seq_enhanced.txt "
"--test_input {file_path}/test_item_seq.txt --gru_step 2 --epochs 10 "
"--lr 0.001 --lr_dc 2 --dc_rate 0.1 --early_stop_epoch 3 "
"--hidden_size 256 --batch_size 256 --max_len 20 --has_uid True "
"--feature_init {file_path}/item_embed_mat.npy --sigma 8 ".format(sr_gnn_lib_path=sr_gnn_lib_path,
item_cnt=item_cnt,
model_path=model_path,
file_path=file_path))
# generate rec
checkpoint_path = find_checkpoint_path(phase, version='v1')
prefix = 'standard_'
rec_path = '{}/{}rec.txt'.format(file_path, prefix)
print("WOC"*20)
print(rec_path)
os.system("python3 {sr_gnn_lib_path}/my_main_.py --task recommend --node_count {item_cnt} "
"--checkpoint_path {checkpoint_path} --item_lookup {file_path}/item_lookup.txt "
"--recommend_output {rec_path} --session_input {file_path}/test_user_sess.txt "
"--gru_step 2 --hidden_size 256 --batch_size 256 --rec_extra_count 50 --has_uid True "
"--feature_init {file_path}/item_embed_mat.npy "
"--max_len 10 --sigma 8".format(sr_gnn_lib_path=sr_gnn_lib_path,
item_cnt=item_cnt, checkpoint_path=checkpoint_path,
file_path=file_path, rec_path=rec_path))
for phase in range(start_phase, now_phase+1):
print('phase={}'.format(phase))
sr_nn_version_1(phase, phase_item_cnt_dict[phase])
其中的rec_path是online的,而eval的时候是offline的
user_data/sr-gnn/online/7/data/standard_rec.txt
所以哪里是不是需要改一下?这里??
is_use_whole_click = True if mode == 'online' else False # True if online
跑srgnn_main.py的时候conf里头mode设为offline
---Original--- From: "VideoRecSys"<[email protected]> Date: Fri, Jul 10, 2020 21:16 PM To: "xuetf/KDD_CUP_2020_Debiasing_Rush"<[email protected]>; Cc: "xuetf"<[email protected]>;"Comment"<[email protected]>; Subject: Re: [xuetf/KDD_CUP_2020_Debiasing_Rush] great job (#3)
执行recall_main.py怎么出现下面的错误呢?
train/validate split done... create offline eval answer done... begin read item df... 108916 train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test (2643000, 4) (1223242, 4) using multi_processing phase: 7 train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test, target_phase=7 drop duplicates... recall-source-num=0 0 read sr-gnn results.... sr-gnn begin... sr-gnn rec path=user_data/sr-gnn/offline/7/data/standard_rec.txt Traceback (most recent call last): File "my_sr_gnn_eval2.py", line 62, in <module> recall_methods={'sr-gnn'}) File "/data1/xulm1/debiasing_rush/code/recall/do_recall_multi_processing.py", line 115, in do_multi_recall_results_multi_processing standard_sr_gnn_recall_item_dict = read_sr_gnn_results(phase, prefix='standard', adjust_type=adjust_type) File "/data1/xulm1/debiasing_rush/code/recall/sr_gnn/read_sr_gnn_results.py", line 54, in read_sr_gnn_results with open(sr_gnn_rec_path) as f: FileNotFoundError: [Errno 2] No such file or directory: 'user_data/sr-gnn/offline/7/data/standard_rec.txt'
我只用的sr-gnn的v1版本
我看了下运行的时候是(展示部分代码)
def sr_nn_version_1(phase, item_cnt): model_path = './models/v1/{}/{}'.format(mode, phase) file_path = '{}/{}/data'.format(sr_gnn_root_dir, phase) sr_gnn_lib_path = 'code/recall/sr_gnn/lib' if os.path.exists(model_path): print('model_path={} exists, delete'.format(model_path)) shutil.rmtree(model_path) if not os.path.exists(model_path): os.makedirs(model_path) os.system("python3 {sr_gnn_lib_path}/my_main_.py --task train --node_count {item_cnt} " "--checkpoint_path {model_path}/session_id --train_input {file_path}/train_item_seq_enhanced.txt " "--test_input {file_path}/test_item_seq.txt --gru_step 2 --epochs 10 " "--lr 0.001 --lr_dc 2 --dc_rate 0.1 --early_stop_epoch 3 " "--hidden_size 256 --batch_size 256 --max_len 20 --has_uid True " "--feature_init {file_path}/item_embed_mat.npy --sigma 8 ".format(sr_gnn_lib_path=sr_gnn_lib_path, item_cnt=item_cnt, model_path=model_path, file_path=file_path)) # generate rec checkpoint_path = find_checkpoint_path(phase, version='v1') prefix = 'standard_' rec_path = '{}/{}rec.txt'.format(file_path, prefix) print("WOC"*20) print(rec_path) os.system("python3 {sr_gnn_lib_path}/my_main_.py --task recommend --node_count {item_cnt} " "--checkpoint_path {checkpoint_path} --item_lookup {file_path}/item_lookup.txt " "--recommend_output {rec_path} --session_input {file_path}/test_user_sess.txt " "--gru_step 2 --hidden_size 256 --batch_size 256 --rec_extra_count 50 --has_uid True " "--feature_init {file_path}/item_embed_mat.npy " "--max_len 10 --sigma 8".format(sr_gnn_lib_path=sr_gnn_lib_path, item_cnt=item_cnt, checkpoint_path=checkpoint_path, file_path=file_path, rec_path=rec_path)) for phase in range(start_phase, now_phase+1): print('phase={}'.format(phase)) sr_nn_version_1(phase, phase_item_cnt_dict[phase])
其中的rec_path是online的,而eval的时候是offline的
user_data/sr-gnn/online/7/data/standard_rec.txt
所以哪里是不是需要改一下?这里??
is_use_whole_click = True if mode == 'online' else False # True if online
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
改成offline,相关代码及结果如下:
mode = 'offline' # offline/online: offline validation or online submission
start_phase = 7
now_phase = 9
2020-07-11 10:28:15,286 main:INFO:The passed save_path is not a valid checkpoint: ./models/v1/offline/7/session_id
2020-07-11 10:28:15,528 main:INFO:Total Batch: 852
2020-07-11 10:28:16.024855: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-07-11 10:28:16,340 main:INFO:Batch 0, Loss: 10.65137
2020-07-11 10:28:22,253 main:INFO:Batch 200, Loss: 10.16564
2020-07-11 10:28:28,126 main:INFO:Batch 400, Loss: 10.05143
2020-07-11 10:28:34,057 main:INFO:Batch 600, Loss: 9.96646
2020-07-11 10:28:39,952 main:INFO:Batch 800, Loss: 9.89508
2020-07-11 10:28:42,776 main:INFO:Test Loss: 9.6436 @50, Recall: 0.1406 MRR: 0.0150
2020-07-11 10:28:43,895 main:INFO:Test Loss: 9.6414 @50, Recall: 0.1562 MRR: 0.0261
2020-07-11 10:28:44,964 main:INFO:Test Loss: 9.5721 @50, Recall: 0.1211 MRR: 0.0142
2020-07-11 10:28:46,034 main:INFO:Test Loss: 9.5152 @50, Recall: 0.0938 MRR: 0.0220
2020-07-11 10:28:47,104 main:INFO:Test Loss: 9.5149 @50, Recall: 0.1250 MRR: 0.0227
2020-07-11 10:28:48,171 main:INFO:Test Loss: 9.4880 @50, Recall: 0.1406 MRR: 0.0246
2020-07-11 10:28:48,404 main:INFO:Test Loss: 9.6964 @50, Recall: 0.1273 MRR: 0.0224
2020-07-11 10:28:48,405 main:INFO:Epoch: 0 Train Loss: 9.8782 Test Loss: 9.5816 Recall: 0.1295 MRR: 0.0208
2020-07-11 10:28:48,405 main:INFO:Best Recall and MRR: 0.1295, 0.0208 Epoch: 0, 0
2020-07-11 10:28:49,054 main:INFO:Total Batch: 852
2020-07-11 10:28:49,084 main:INFO:Batch 0, Loss: 8.70335
2020-07-11 10:28:55,017 main:INFO:Batch 200, Loss: 8.60159
2020-07-11 10:29:00,918 main:INFO:Batch 400, Loss: 8.63617
2020-07-11 10:29:06,821 main:INFO:Batch 600, Loss: 8.67724
2020-07-11 10:29:12,725 main:INFO:Batch 800, Loss: 8.70898
2020-07-11 10:29:15,292 main:INFO:Test Loss: 9.5713 @50, Recall: 0.1406 MRR: 0.0197
2020-07-11 10:29:16,364 main:INFO:Test Loss: 9.5316 @50, Recall: 0.1680 MRR: 0.0376
2020-07-11 10:29:17,434 main:INFO:Test Loss: 9.4835 @50, Recall: 0.1367 MRR: 0.0195
2020-07-11 10:29:18,505 main:INFO:Test Loss: 9.4644 @50, Recall: 0.1172 MRR: 0.0227
2020-07-11 10:29:19,573 main:INFO:Test Loss: 9.4060 @50, Recall: 0.1445 MRR: 0.0252
2020-07-11 10:29:20,643 main:INFO:Test Loss: 9.3890 @50, Recall: 0.1719 MRR: 0.0272
2020-07-11 10:29:20,876 main:INFO:Test Loss: 9.7295 @50, Recall: 0.1273 MRR: 0.0285
2020-07-11 10:29:20,876 main:INFO:Epoch: 1 Train Loss: 8.7163 Test Loss: 9.5107 Recall: 0.1458 MRR: 0.0254
2020-07-11 10:29:20,876 main:INFO:Best Recall and MRR: 0.1458, 0.0254 Epoch: 1, 1
2020-07-11 10:29:21,360 main:INFO:Total Batch: 852
2020-07-11 10:29:21,391 main:INFO:Batch 0, Loss: 7.77480
2020-07-11 10:29:27,353 main:INFO:Batch 200, Loss: 7.65744
2020-07-11 10:29:33,264 main:INFO:Batch 400, Loss: 7.64605
2020-07-11 10:29:39,178 main:INFO:Batch 600, Loss: 7.63783
2020-07-11 10:29:45,073 main:INFO:Batch 800, Loss: 7.63648
2020-07-11 10:29:47,642 main:INFO:Test Loss: 9.6038 @50, Recall: 0.1328 MRR: 0.0203
2020-07-11 10:29:48,716 main:INFO:Test Loss: 9.5496 @50, Recall: 0.1680 MRR: 0.0392
2020-07-11 10:29:49,792 main:INFO:Test Loss: 9.5350 @50, Recall: 0.1367 MRR: 0.0207
2020-07-11 10:29:50,866 main:INFO:Test Loss: 9.5222 @50, Recall: 0.1172 MRR: 0.0261
2020-07-11 10:29:51,935 main:INFO:Test Loss: 9.4398 @50, Recall: 0.1445 MRR: 0.0253
2020-07-11 10:29:53,007 main:INFO:Test Loss: 9.4178 @50, Recall: 0.1719 MRR: 0.0300
2020-07-11 10:29:53,237 main:INFO:Test Loss: 9.8248 @50, Recall: 0.1273 MRR: 0.0292
2020-07-11 10:29:53,238 main:INFO:Epoch: 2 Train Loss: 7.6361 Test Loss: 9.5561 Recall: 0.1446 MRR: 0.0270
2020-07-11 10:29:53,238 main:INFO:Best Recall and MRR: 0.1458, 0.0270 Epoch: 1, 2
2020-07-11 10:29:53,725 main:INFO:Total Batch: 852
2020-07-11 10:29:53,756 main:INFO:Batch 0, Loss: 7.37897
2020-07-11 10:29:59,665 main:INFO:Batch 200, Loss: 7.46712
2020-07-11 10:30:05,581 main:INFO:Batch 400, Loss: 7.47876
2020-07-11 10:30:11,496 main:INFO:Batch 600, Loss: 7.48664
2020-07-11 10:30:17,376 main:INFO:Batch 800, Loss: 7.49818
2020-07-11 10:30:19,940 main:INFO:Test Loss: 9.6434 @50, Recall: 0.1367 MRR: 0.0196
2020-07-11 10:30:21,010 main:INFO:Test Loss: 9.5648 @50, Recall: 0.1602 MRR: 0.0399
2020-07-11 10:30:22,100 main:INFO:Test Loss: 9.5620 @50, Recall: 0.1367 MRR: 0.0205
2020-07-11 10:30:23,174 main:INFO:Test Loss: 9.5489 @50, Recall: 0.1172 MRR: 0.0260
2020-07-11 10:30:24,245 main:INFO:Test Loss: 9.4694 @50, Recall: 0.1406 MRR: 0.0253
2020-07-11 10:30:25,320 main:INFO:Test Loss: 9.4457 @50, Recall: 0.1719 MRR: 0.0299
2020-07-11 10:30:25,551 main:INFO:Test Loss: 9.8384 @50, Recall: 0.1273 MRR: 0.0286
2020-07-11 10:30:25,551 main:INFO:Epoch: 3 Train Loss: 7.5000 Test Loss: 9.5818 Recall: 0.1433 MRR: 0.0269
2020-07-11 10:30:25,552 main:INFO:Best Recall and MRR: 0.1458, 0.0270 Epoch: 1, 2
2020-07-11 10:30:25,579 main:INFO:Total Batch: 852
2020-07-11 10:30:25,610 main:INFO:Batch 0, Loss: 7.22226
2020-07-11 10:30:31,536 main:INFO:Batch 200, Loss: 7.33525
2020-07-11 10:30:37,459 main:INFO:Batch 400, Loss: 7.34088
2020-07-11 10:30:43,366 main:INFO:Batch 600, Loss: 7.34335
2020-07-11 10:30:49,260 main:INFO:Batch 800, Loss: 7.34473
2020-07-11 10:30:51,829 main:INFO:Test Loss: 9.6634 @50, Recall: 0.1328 MRR: 0.0196
2020-07-11 10:30:52,900 main:INFO:Test Loss: 9.5799 @50, Recall: 0.1562 MRR: 0.0396
2020-07-11 10:30:53,970 main:INFO:Test Loss: 9.5829 @50, Recall: 0.1250 MRR: 0.0204
2020-07-11 10:30:55,039 main:INFO:Test Loss: 9.5698 @50, Recall: 0.1133 MRR: 0.0265
2020-07-11 10:30:56,108 main:INFO:Test Loss: 9.4902 @50, Recall: 0.1406 MRR: 0.0250
2020-07-11 10:30:57,176 main:INFO:Test Loss: 9.4646 @50, Recall: 0.1641 MRR: 0.0284
2020-07-11 10:30:57,408 main:INFO:Test Loss: 9.8624 @50, Recall: 0.1273 MRR: 0.0289
2020-07-11 10:30:57,408 main:INFO:Epoch: 4 Train Loss: 7.3447 Test Loss: 9.6019 Recall: 0.1383 MRR: 0.0266
2020-07-11 10:30:57,409 main:INFO:Best Recall and MRR: 0.1458, 0.0270 Epoch: 1, 2
2020-07-11 10:30:57,435 main:INFO:Total Batch: 852
2020-07-11 10:30:57,466 main:INFO:Batch 0, Loss: 7.36846
2020-07-11 10:31:03,410 main:INFO:Batch 200, Loss: 7.32874
2020-07-11 10:31:09,330 main:INFO:Batch 400, Loss: 7.33265
2020-07-11 10:31:15,241 main:INFO:Batch 600, Loss: 7.32835
2020-07-11 10:31:21,188 main:INFO:Batch 800, Loss: 7.33017
2020-07-11 10:31:23,767 main:INFO:Test Loss: 9.6705 @50, Recall: 0.1328 MRR: 0.0195
2020-07-11 10:31:24,840 main:INFO:Test Loss: 9.5850 @50, Recall: 0.1562 MRR: 0.0393
2020-07-11 10:31:25,911 main:INFO:Test Loss: 9.5887 @50, Recall: 0.1250 MRR: 0.0210
2020-07-11 10:31:26,980 main:INFO:Test Loss: 9.5768 @50, Recall: 0.1133 MRR: 0.0263
2020-07-11 10:31:28,049 main:INFO:Test Loss: 9.4961 @50, Recall: 0.1406 MRR: 0.0251
2020-07-11 10:31:29,119 main:INFO:Test Loss: 9.4713 @50, Recall: 0.1641 MRR: 0.0289
2020-07-11 10:31:29,349 main:INFO:Test Loss: 9.8707 @50, Recall: 0.1273 MRR: 0.0287
2020-07-11 10:31:29,349 main:INFO:Epoch: 5 Train Loss: 7.3303 Test Loss: 9.6085 Recall: 0.1383 MRR: 0.0268
2020-07-11 10:31:29,349 main:INFO:Best Recall and MRR: 0.1458, 0.0270 Epoch: 1, 2
2020-07-11 10:31:29,350 main:INFO:After 3 epochs not improve, early stop
2020-07-11 10:31:29,350 main:INFO:Best Recall and MRR: 0.1458, 0.0270 Epoch: 1, 2
CheckPoint: ./models/v1/offline/7/session_id-2556
eval阶段还是有问题啊
train/validate split done...
create offline eval answer done...
begin read item df...
108916
train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test
(2643000, 4)
(1223242, 4)
using multi_processing
phase: 7
train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test, target_phase=7
drop duplicates...
recall-source-num=0
0
read sr-gnn results....
sr-gnn begin...
sr-gnn rec path=user_data/sr-gnn/offline/7/data/standard_rec.txt
read sr-gnn done, num=1600
160000
train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test, target_phase=7
train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test
(2643000, 4)
(1223242, 4)
user_id item_id time phase
2847 1 47611 0.983887 0
17907 1 76240 0.983770 0
18017 1 78142 0.983742 0
18604 1 89568 0.983763 0
19045 1 97795 0.983877 0
group done
num=159301, filter_num=699
read standard sr-gnn results done....
sr-gnn begin...
sr-gnn rec path=user_data/sr-gnn/offline/7/data/pos_node_weight_rec.txt
Traceback (most recent call last):
File "my_sr_gnn_eval2.py", line 62, in <module>
recall_methods={'sr-gnn'})
File "/data1/xulm1/debiasing_rush/code/recall/do_recall_multi_processing.py", line 119, in do_multi_recall_results_multi_processing
adjust_type=adjust_type)
File "/data1/xulm1/debiasing_rush/code/recall/sr_gnn/read_sr_gnn_results.py", line 54, in read_sr_gnn_results
with open(sr_gnn_rec_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'user_data/sr-gnn/offline/7/data/pos_node_weight_rec.txt'
这是为啥呢?
改成offline,相关代码及结果如下:
mode = 'offline' # offline/online: offline validation or online submission start_phase = 7 now_phase = 9 2020-07-11 10:28:15,286 main:INFO:The passed save_path is not a valid checkpoint: ./models/v1/offline/7/session_id 2020-07-11 10:28:15,528 main:INFO:Total Batch: 852 2020-07-11 10:28:16.024855: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2020-07-11 10:28:16,340 main:INFO:Batch 0, Loss: 10.65137 2020-07-11 10:28:22,253 main:INFO:Batch 200, Loss: 10.16564 2020-07-11 10:28:28,126 main:INFO:Batch 400, Loss: 10.05143 2020-07-11 10:28:34,057 main:INFO:Batch 600, Loss: 9.96646 2020-07-11 10:28:39,952 main:INFO:Batch 800, Loss: 9.89508 2020-07-11 10:28:42,776 main:INFO:Test Loss: 9.6436 @50, Recall: 0.1406 MRR: 0.0150 2020-07-11 10:28:43,895 main:INFO:Test Loss: 9.6414 @50, Recall: 0.1562 MRR: 0.0261 2020-07-11 10:28:44,964 main:INFO:Test Loss: 9.5721 @50, Recall: 0.1211 MRR: 0.0142 2020-07-11 10:28:46,034 main:INFO:Test Loss: 9.5152 @50, Recall: 0.0938 MRR: 0.0220 2020-07-11 10:28:47,104 main:INFO:Test Loss: 9.5149 @50, Recall: 0.1250 MRR: 0.0227 2020-07-11 10:28:48,171 main:INFO:Test Loss: 9.4880 @50, Recall: 0.1406 MRR: 0.0246 2020-07-11 10:28:48,404 main:INFO:Test Loss: 9.6964 @50, Recall: 0.1273 MRR: 0.0224 2020-07-11 10:28:48,405 main:INFO:Epoch: 0 Train Loss: 9.8782 Test Loss: 9.5816 Recall: 0.1295 MRR: 0.0208 2020-07-11 10:28:48,405 main:INFO:Best Recall and MRR: 0.1295, 0.0208 Epoch: 0, 0 2020-07-11 10:28:49,054 main:INFO:Total Batch: 852 2020-07-11 10:28:49,084 main:INFO:Batch 0, Loss: 8.70335 2020-07-11 10:28:55,017 main:INFO:Batch 200, Loss: 8.60159 2020-07-11 10:29:00,918 main:INFO:Batch 400, Loss: 8.63617 2020-07-11 10:29:06,821 main:INFO:Batch 600, Loss: 8.67724 2020-07-11 10:29:12,725 main:INFO:Batch 800, Loss: 8.70898 2020-07-11 10:29:15,292 main:INFO:Test Loss: 9.5713 @50, Recall: 0.1406 MRR: 0.0197 2020-07-11 10:29:16,364 main:INFO:Test Loss: 9.5316 @50, Recall: 0.1680 MRR: 0.0376 2020-07-11 10:29:17,434 main:INFO:Test Loss: 9.4835 @50, Recall: 0.1367 MRR: 0.0195 2020-07-11 10:29:18,505 main:INFO:Test Loss: 9.4644 @50, Recall: 0.1172 MRR: 0.0227 2020-07-11 10:29:19,573 main:INFO:Test Loss: 9.4060 @50, Recall: 0.1445 MRR: 0.0252 2020-07-11 10:29:20,643 main:INFO:Test Loss: 9.3890 @50, Recall: 0.1719 MRR: 0.0272 2020-07-11 10:29:20,876 main:INFO:Test Loss: 9.7295 @50, Recall: 0.1273 MRR: 0.0285 2020-07-11 10:29:20,876 main:INFO:Epoch: 1 Train Loss: 8.7163 Test Loss: 9.5107 Recall: 0.1458 MRR: 0.0254 2020-07-11 10:29:20,876 main:INFO:Best Recall and MRR: 0.1458, 0.0254 Epoch: 1, 1 2020-07-11 10:29:21,360 main:INFO:Total Batch: 852 2020-07-11 10:29:21,391 main:INFO:Batch 0, Loss: 7.77480 2020-07-11 10:29:27,353 main:INFO:Batch 200, Loss: 7.65744 2020-07-11 10:29:33,264 main:INFO:Batch 400, Loss: 7.64605 2020-07-11 10:29:39,178 main:INFO:Batch 600, Loss: 7.63783 2020-07-11 10:29:45,073 main:INFO:Batch 800, Loss: 7.63648 2020-07-11 10:29:47,642 main:INFO:Test Loss: 9.6038 @50, Recall: 0.1328 MRR: 0.0203 2020-07-11 10:29:48,716 main:INFO:Test Loss: 9.5496 @50, Recall: 0.1680 MRR: 0.0392 2020-07-11 10:29:49,792 main:INFO:Test Loss: 9.5350 @50, Recall: 0.1367 MRR: 0.0207 2020-07-11 10:29:50,866 main:INFO:Test Loss: 9.5222 @50, Recall: 0.1172 MRR: 0.0261 2020-07-11 10:29:51,935 main:INFO:Test Loss: 9.4398 @50, Recall: 0.1445 MRR: 0.0253 2020-07-11 10:29:53,007 main:INFO:Test Loss: 9.4178 @50, Recall: 0.1719 MRR: 0.0300 2020-07-11 10:29:53,237 main:INFO:Test Loss: 9.8248 @50, Recall: 0.1273 MRR: 0.0292 2020-07-11 10:29:53,238 main:INFO:Epoch: 2 Train Loss: 7.6361 Test Loss: 9.5561 Recall: 0.1446 MRR: 0.0270 2020-07-11 10:29:53,238 main:INFO:Best Recall and MRR: 0.1458, 0.0270 Epoch: 1, 2 2020-07-11 10:29:53,725 main:INFO:Total Batch: 852 2020-07-11 10:29:53,756 main:INFO:Batch 0, Loss: 7.37897 2020-07-11 10:29:59,665 main:INFO:Batch 200, Loss: 7.46712 2020-07-11 10:30:05,581 main:INFO:Batch 400, Loss: 7.47876 2020-07-11 10:30:11,496 main:INFO:Batch 600, Loss: 7.48664 2020-07-11 10:30:17,376 main:INFO:Batch 800, Loss: 7.49818 2020-07-11 10:30:19,940 main:INFO:Test Loss: 9.6434 @50, Recall: 0.1367 MRR: 0.0196 2020-07-11 10:30:21,010 main:INFO:Test Loss: 9.5648 @50, Recall: 0.1602 MRR: 0.0399 2020-07-11 10:30:22,100 main:INFO:Test Loss: 9.5620 @50, Recall: 0.1367 MRR: 0.0205 2020-07-11 10:30:23,174 main:INFO:Test Loss: 9.5489 @50, Recall: 0.1172 MRR: 0.0260 2020-07-11 10:30:24,245 main:INFO:Test Loss: 9.4694 @50, Recall: 0.1406 MRR: 0.0253 2020-07-11 10:30:25,320 main:INFO:Test Loss: 9.4457 @50, Recall: 0.1719 MRR: 0.0299 2020-07-11 10:30:25,551 main:INFO:Test Loss: 9.8384 @50, Recall: 0.1273 MRR: 0.0286 2020-07-11 10:30:25,551 main:INFO:Epoch: 3 Train Loss: 7.5000 Test Loss: 9.5818 Recall: 0.1433 MRR: 0.0269 2020-07-11 10:30:25,552 main:INFO:Best Recall and MRR: 0.1458, 0.0270 Epoch: 1, 2 2020-07-11 10:30:25,579 main:INFO:Total Batch: 852 2020-07-11 10:30:25,610 main:INFO:Batch 0, Loss: 7.22226 2020-07-11 10:30:31,536 main:INFO:Batch 200, Loss: 7.33525 2020-07-11 10:30:37,459 main:INFO:Batch 400, Loss: 7.34088 2020-07-11 10:30:43,366 main:INFO:Batch 600, Loss: 7.34335 2020-07-11 10:30:49,260 main:INFO:Batch 800, Loss: 7.34473 2020-07-11 10:30:51,829 main:INFO:Test Loss: 9.6634 @50, Recall: 0.1328 MRR: 0.0196 2020-07-11 10:30:52,900 main:INFO:Test Loss: 9.5799 @50, Recall: 0.1562 MRR: 0.0396 2020-07-11 10:30:53,970 main:INFO:Test Loss: 9.5829 @50, Recall: 0.1250 MRR: 0.0204 2020-07-11 10:30:55,039 main:INFO:Test Loss: 9.5698 @50, Recall: 0.1133 MRR: 0.0265 2020-07-11 10:30:56,108 main:INFO:Test Loss: 9.4902 @50, Recall: 0.1406 MRR: 0.0250 2020-07-11 10:30:57,176 main:INFO:Test Loss: 9.4646 @50, Recall: 0.1641 MRR: 0.0284 2020-07-11 10:30:57,408 main:INFO:Test Loss: 9.8624 @50, Recall: 0.1273 MRR: 0.0289 2020-07-11 10:30:57,408 main:INFO:Epoch: 4 Train Loss: 7.3447 Test Loss: 9.6019 Recall: 0.1383 MRR: 0.0266 2020-07-11 10:30:57,409 main:INFO:Best Recall and MRR: 0.1458, 0.0270 Epoch: 1, 2 2020-07-11 10:30:57,435 main:INFO:Total Batch: 852 2020-07-11 10:30:57,466 main:INFO:Batch 0, Loss: 7.36846 2020-07-11 10:31:03,410 main:INFO:Batch 200, Loss: 7.32874 2020-07-11 10:31:09,330 main:INFO:Batch 400, Loss: 7.33265 2020-07-11 10:31:15,241 main:INFO:Batch 600, Loss: 7.32835 2020-07-11 10:31:21,188 main:INFO:Batch 800, Loss: 7.33017 2020-07-11 10:31:23,767 main:INFO:Test Loss: 9.6705 @50, Recall: 0.1328 MRR: 0.0195 2020-07-11 10:31:24,840 main:INFO:Test Loss: 9.5850 @50, Recall: 0.1562 MRR: 0.0393 2020-07-11 10:31:25,911 main:INFO:Test Loss: 9.5887 @50, Recall: 0.1250 MRR: 0.0210 2020-07-11 10:31:26,980 main:INFO:Test Loss: 9.5768 @50, Recall: 0.1133 MRR: 0.0263 2020-07-11 10:31:28,049 main:INFO:Test Loss: 9.4961 @50, Recall: 0.1406 MRR: 0.0251 2020-07-11 10:31:29,119 main:INFO:Test Loss: 9.4713 @50, Recall: 0.1641 MRR: 0.0289 2020-07-11 10:31:29,349 main:INFO:Test Loss: 9.8707 @50, Recall: 0.1273 MRR: 0.0287 2020-07-11 10:31:29,349 main:INFO:Epoch: 5 Train Loss: 7.3303 Test Loss: 9.6085 Recall: 0.1383 MRR: 0.0268 2020-07-11 10:31:29,349 main:INFO:Best Recall and MRR: 0.1458, 0.0270 Epoch: 1, 2 2020-07-11 10:31:29,350 main:INFO:After 3 epochs not improve, early stop 2020-07-11 10:31:29,350 main:INFO:Best Recall and MRR: 0.1458, 0.0270 Epoch: 1, 2 CheckPoint: ./models/v1/offline/7/session_id-2556
eval阶段还是有问题啊
train/validate split done... create offline eval answer done... begin read item df... 108916 train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test (2643000, 4) (1223242, 4) using multi_processing phase: 7 train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test, target_phase=7 drop duplicates... recall-source-num=0 0 read sr-gnn results.... sr-gnn begin... sr-gnn rec path=user_data/sr-gnn/offline/7/data/standard_rec.txt read sr-gnn done, num=1600 160000 train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test, target_phase=7 train_path=user_data/offline_underexpose_train, test_path=user_data/offline_underexpose_test (2643000, 4) (1223242, 4) user_id item_id time phase 2847 1 47611 0.983887 0 17907 1 76240 0.983770 0 18017 1 78142 0.983742 0 18604 1 89568 0.983763 0 19045 1 97795 0.983877 0 group done num=159301, filter_num=699 read standard sr-gnn results done.... sr-gnn begin... sr-gnn rec path=user_data/sr-gnn/offline/7/data/pos_node_weight_rec.txt Traceback (most recent call last): File "my_sr_gnn_eval2.py", line 62, in <module> recall_methods={'sr-gnn'}) File "/data1/xulm1/debiasing_rush/code/recall/do_recall_multi_processing.py", line 119, in do_multi_recall_results_multi_processing adjust_type=adjust_type) File "/data1/xulm1/debiasing_rush/code/recall/sr_gnn/read_sr_gnn_results.py", line 54, in read_sr_gnn_results with open(sr_gnn_rec_path) as f: FileNotFoundError: [Errno 2] No such file or directory: 'user_data/sr-gnn/offline/7/data/pos_node_weight_rec.txt'
这是为啥呢?
v1版本对应的结果,eval的时候已经读取成功了: sr-gnn rec path=user_data/sr-gnn/offline/7/data/standard_rec.txt read sr-gnn done, num=1600 v2版本对应的结果读取失败,user_data/sr-gnn/offline/7/data/pos_node_weight_rec.txt v2的sr-gnn你没有跑,当然失败了。
v1版本的指标似乎没有显示啊?大佬,哪里有问题吗?
v1版本的指标似乎没有显示啊?大佬,哪里有问题吗?
你可以认真读一下代码。多路召回合并结果后才会进行评估。不是每路单独评估的。你想每路单独评估也可以改下代码。
请教下在训练时设置mode为offline数据似乎少了很多啊,只有800多batch,而设置为online就有几千多Total batch,这样做的原因是什么呢?多谢大佬
线下的时候用的单个phase的数据跑,线上的时候用的所有数据跑 (二者gap比较固定)。请认真阅读README.md中Evaluation部分的说明。
---Original--- From: "VideoRecSys"<[email protected]> Date: Sat, Jul 11, 2020 15:45 PM To: "xuetf/KDD_CUP_2020_Debiasing_Rush"<[email protected]>; Cc: "xuetf"<[email protected]>;"Comment"<[email protected]>; Subject: Re: [xuetf/KDD_CUP_2020_Debiasing_Rush] great job (#3)
请教下在训练时设置mode为offline数据似乎少了很多啊,只有800多batch,而设置为online就有几千多Total batch,这样做的原因是什么呢?多谢大佬
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
哈喽大佬,有没有关于数据集的详细解释啊?能给个链接吗?多谢
README上给了官网链接了
---Original--- From: "VideoRecSys"<[email protected]> Date: Sun, Jul 12, 2020 14:05 PM To: "xuetf/KDD_CUP_2020_Debiasing_Rush"<[email protected]>; Cc: "xuetf"<[email protected]>;"Comment"<[email protected]>; Subject: Re: [xuetf/KDD_CUP_2020_Debiasing_Rush] great job (#3)
哈喽大佬,有没有关于数据集的详细解释啊?能给个链接吗?多谢
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
大佬,我看您的数据中似乎没有用到underexpose_user_feat.csv?这个数据是官方没有提供吗?
It includes another file named underexpose_user_feat.csv, the columns of which are: user_id, user_age_level, user_gender, user_city_level
大佬,请教下在设置offline后get online_topk
online_total_click = pd.DataFrame()
for c in range(now_phase + 1):
print('phase:', c)
click_train = pd.read_csv('{}/{}-{}.csv'.format(online_train_path, train_file_prefix, c), header=None,
names=['user_id', 'item_id', 'time'])
phase_test_path = "{}/{}-{}".format(test_path, test_file_prefix, c)
click_test = pd.read_csv('{}/{}-{}.csv'.format(phase_test_path, test_file_prefix, c), header=None,
names=['user_id', 'item_id', 'time'])
all_click = click_train.append(click_test)
all_click['phase'] = c
online_total_click = online_total_click.append(all_click)
print(online_total_click.shape)
online_total_click = online_total_click.drop_duplicates(['user_id', 'item_id', 'time'])
print(online_total_click.shape)
online_top50_click_np = online_total_click['item_id'].value_counts().index[:50].values
online_top50_click = ','.join([str(i) for i in online_top50_click_np])
这个得到的是online的train和offline的test的合并后的数据啊,能这么做吗?
hi,dear confused about the norm,
def process_item_feat(item_feat_df):
processed_item_feat_df = item_feat_df.copy()
# norm
txt_item_feat_np = processed_item_feat_df[txt_dense_feat].values
img_item_feat_np = processed_item_feat_df[img_dense_feat].values
txt_item_feat_np = txt_item_feat_np / np.linalg.norm(txt_item_feat_np, axis=1, keepdims=True)
img_item_feat_np = img_item_feat_np / np.linalg.norm(img_item_feat_np, axis=1, keepdims=True)
processed_item_feat_df[txt_dense_feat] = pd.DataFrame(txt_item_feat_np, columns=txt_dense_feat)
processed_item_feat_df[img_dense_feat] = pd.DataFrame(img_item_feat_np, columns=img_dense_feat)
return processed_item_feat_df
其中的归一化是按照行进行的,每列是特征,为啥按照行进行归一化呢?举例如下:
>>> xx=np.random.randn(3,4)
>>> xx
array([[ 0.18874834, 0.37971162, 0.8287003 , -0.95896989],
[-0.07977954, 0.04206023, -0.23647192, -0.36731412],
[ 1.77722951, 0.68746666, -1.77812892, 0.54136854]])
>>> np.linalg.norm(xx, axis=1, keepdims=True)
array([[1.33647832],
[0.4460633 ],
[2.66194994]])
这说明每行计算一个2范数
注意看特征的含义,128维图片向量,128维文本向量,图片和文本向量分别做归一化。
---Original--- From: "VideoRecSys"<[email protected]> Date: Tue, Jul 14, 2020 20:37 PM To: "xuetf/KDD_CUP_2020_Debiasing_Rush"<[email protected]>; Cc: "xuetf"<[email protected]>;"Comment"<[email protected]>; Subject: Re: [xuetf/KDD_CUP_2020_Debiasing_Rush] great job (#3)
hi,dear
confused about the norm,
def process_item_feat(item_feat_df): processed_item_feat_df = item_feat_df.copy() # norm txt_item_feat_np = processed_item_feat_df[txt_dense_feat].values img_item_feat_np = processed_item_feat_df[img_dense_feat].values txt_item_feat_np = txt_item_feat_np / np.linalg.norm(txt_item_feat_np, axis=1, keepdims=True) img_item_feat_np = img_item_feat_np / np.linalg.norm(img_item_feat_np, axis=1, keepdims=True) processed_item_feat_df[txt_dense_feat] = pd.DataFrame(txt_item_feat_np, columns=txt_dense_feat) processed_item_feat_df[img_dense_feat] = pd.DataFrame(img_item_feat_np, columns=img_dense_feat) return processed_item_feat_df
其中的归一化是按照行进行的,每列是特征,为啥按照行进行归一化呢?举例如下:
>>> xx=np.random.randn(3,4) >>> xx array([[ 0.18874834, 0.37971162, 0.8287003 , -0.95896989], [-0.07977954, 0.04206023, -0.23647192, -0.36731412], [ 1.77722951, 0.68746666, -1.77812892, 0.54136854]]) >>> np.linalg.norm(xx, axis=1, keepdims=True) array([[1.33647832], [0.4460633 ], [2.66194994]])
这说明每行计算一个2范数
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
这个看出来了,我的意思是您的做法是对行进行归一化axis=1, 每列是个特征,为啥不是对列进行归一化呢?axis=0?? 多谢
请教大佬这里的啥意思啊?
def cal_occ(sentence):
for i, word in enumerate(sentence):
hist_len = len(sentence)
co_occur_dict.setdefault(word, {})
for j in range(max(i - window, 0), min(i + window, hist_len)):
if j == i or word == sentence[j]: continue
loc_weight = (0.9 ** abs(i - j))
co_occur_dict[word].setdefault(sentence[j], 0)
co_occur_dict[word][sentence[j]] += loc_weight
其中的
for j in range(max(i - window, 0), min(i + window, hist_len)):
if j == i or word == sentence[j]: continue
loc_weight = (0.9 ** abs(i - j))
co_occur_dict[word].setdefault(sentence[j], 0)
co_occur_dict[word][sentence[j]] += loc_weight
帮忙看下,多谢
哈喽,大佬这个函数是填充那些没有txt,img特征的item的吗?
def fill_item_feat(processed_item_feat_df, item_content_vec_dict):
如果item都有这些特征是不是就不需要填充了?
哈喽,大佬,我可以将phase7,8,9的数据搁在一起进行预测吗? 也就是不区分phase了,由训练集直接得到给所有user推items,这样做可以吗?
hi,大佬 faiss都没有引入,为啥不报错呢?好诡异啊 请教下大佬是怎么做到的? 在notebook中的文件Rush_0615.ipynb